Using Spark UDFs with struct sequences - scala

Given a dataframe in which one column is a sequence of structs generated by the following sequence
val df = spark
.range(10)
.map((i) => (i % 2, util.Random.nextInt(10), util.Random.nextInt(10)))
.toDF("a","b","c")
.groupBy("a")
.agg(collect_list(struct($"b",$"c")).as("my_list"))
df.printSchema
df.show(false)
Outputs
root
|-- a: long (nullable = false)
|-- my_list: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- b: integer (nullable = false)
| | |-- c: integer (nullable = false)
+---+-----------------------------------+
|a |my_list |
+---+-----------------------------------+
|0 |[[0,3], [9,5], [3,1], [4,2], [3,3]]|
|1 |[[1,7], [4,6], [5,9], [6,4], [3,9]]|
+---+-----------------------------------+
I need to run a function over each struct list. The function prototype is similar to the function below
case class DataPoint(b: Int, c: Int)
def do_something_with_data(data: Seq[DataPoint]): Double = {
// This is an example. I don't actually want the sum
data.map(data_point => data_point.b + data_point.c).sum
}
I want to store the result of this function to another DataFrame column.
I tried to run
val my_udf = udf(do_something_with_data(_))
val df_with_result = df.withColumn("result", my_udf($"my_list"))
df_with_result.show(false)
and got
17/07/13 12:33:42 WARN TaskSetManager: Lost task 0.0 in stage 15.0 (TID 225, REDACTED, executor 0): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (array<struct<b:int,c:int>>) => double)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to $line27.$read$$iw$$iw$DataPoint
at $line28.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$do_something_with_data$1.apply(<console>:29)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at $line28.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.do_something_with_data(<console>:29)
at $line32.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:29)
at $line32.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:29)
Is it possible to use a UDF like this without first casting my rows to a container struct with the DataFrame API?
Doing something like:
case class MyRow(a: Long, my_list: Seq[DataPoint])
df.as[MyRow].map(_ => (a, my_list, my_udf(my_list)))
using the DataSet api works, but I'd prefer to stick with the DataFrame API if possible.

You cannot use a case-class as the input-argument of your UDF (but you can return case classes from the UDF). To map an array of structs, you can pass in a Seq[Row] to your UDF:
val my_uDF = udf((data: Seq[Row]) => {
// This is an example. I don't actually want the sum
data.map{case Row(x:Int,y:Int) => x+y}.sum
})
df.withColumn("result", my_udf($"my_list")).show
+---+--------------------+------+
| a| my_list|result|
+---+--------------------+------+
| 0|[[0,3], [5,5], [3...| 41|
| 1|[[0,9], [4,9], [6...| 54|
+---+--------------------+------+

Related

SPARK java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast

i am trying to use the agg function with type safe check ,i created a case class for the dataset and defined its schema
case class JosmSalesRecord(orderId: Long,
totalOrderSales : Double,
totalOrderCount: Long)
object JosmSalesRecord extends SparkSessionWrapper {
import sparkSession.implicits._
val schema: Option[StructType] = Some(StructType(Seq(
StructField("order_id", IntegerType ,true),
StructField("total_order_sales",DoubleType,true),
StructField("total_order_count", IntegerType,true)
)))
}
DataSet
+----------+------------------+---------------+
| orderId| totalOrderSales|totalOrderCount|
+----------+------------------+---------------+
|1071131089| 433.68| 8|
|1071386263| 8848.42000000001| 343|
|1071439146|108.39999999999999| 8|
|1071349454|34950.400000000074| 512|
|1071283654|349.65000000000003| 27|
root
|-- orderId: long (nullable = false)
|-- totalOrderSales: double (nullable = false)
|-- totalOrderCount: long (nullable = false)
i am applying the
following function on the dataset.
val pv = aggregateSum.agg(typed.sum[JosmSalesRecord](_.totalOrderSales),
typed.sumLong[JosmSalesRecord](_.totalOrderCount)).toDF("totalOrderSales", "totalOrderCount")
when applying the pv.show(),i am getting the
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to com.pipeline.dataset.JosmSalesRecord
at com.FirstApp$$anonfun$7.apply(FirstApp.scala:78)
at org.apache.spark.sql.execution.aggregate.TypedSumDouble.reduce(typedaggregators.scala:32)
at org.apache.spark.sql.execution.aggregate.TypedSumDouble.reduce(typedaggregators.scala:30)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.agg_doConsume_1$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.serializefromobject_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.mapelements_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.deserializetoobject_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.agg_doAggregateWithKeysOutput_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.agg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Note --
i Strictly want to use import org.apache.spark.sql.expressions.scalalang.typed.sum
typed safe function. I am getting my answer while applying the sum of import org.apache.spark.sql.functions
If aggregateSum is a DataFrame, e.g.
val aggregateSum = Seq((1071131089L, 433.68, 8),(1071386263L,8848.42000000001,343)).toDF("orderId", "totalOrderSales", "totalOrderCount")
In current versions of Spark (i.e. 3.x), typed is deprecated. You can still use type safety with something like this (Spark 3):
case class OutputRecord(totalOrderSales : Double, totalOrderCount: Long)
val pv = aggregateSum.as[JosmSalesRecord]
.groupByKey(_ => 1)
.mapGroups((k,vs) => vs.map((x:JosmSalesRecord) => (x.totalOrderSales, x.totalOrderCount))
.reduceOption((x, y) => (x._1 + y._1, x._2 + y._2))
.map{case (x, y) => OutputRecord(x,y)}
.getOrElse(OutputRecord(0.0, 0L)))
pv.show()
gives
+----------------+---------------+
| totalOrderSales|totalOrderCount|
+----------------+---------------+
|9282.10000000001| 351|
+----------------+---------------+
If you have Scala cats as a dependency already, then you can also do something like this

How to create Spark DataFrame from RDD[Row] when Row contains Map[Map]

This question is the continuation of this other one, where the user
who gave the valid answer requested a new question to explain my further doubts.
What I am trying is to generate a dataframe from a RDD[Objects] where my objects has got primitive types, but also complex types. In the previous questions it was explained how to parse a complex type Map.
What I tried next is to extrapolate the given solution to parse a Map[Map]. So in the DataFrame it is converted into a Array(Map).
Below I give the code I have written so far:
//I get an Object from Hbase here
val objectRDD : RDD[HbaseRecord] = ...
//I convert the RDD[HbaseRecord] into RDD[Row]
val rowRDD : RDD[Row] = objectRDD.map(
hbaseRecord => {
val uuid : String = hbaseRecord.uuid
val timestamp : String = hbaseRecord.timestamp
val name = Row(hbaseRecord.nameMap.firstName.getOrElse(""),
hbaseRecord.nameMap.middleName.getOrElse(""),
hbaseRecord.nameMap.lastName.getOrElse(""))
val contactsMap = hbaseRecord.contactsMap
val homeContactMap = contactsMap.get("HOME")
val homeContact = Row(homeContactMap.contactType,
homeContactMap.areaCode,
homeContactMap.number)
val workContactMap = contactsMap.get("WORK")
val workContact = Row(workContactMap.contactType,
workContactMap.areaCode,
workContactMap.number)
val contacts = Row(homeContact,workContact)
Row(uuid, timestamp, name, contacts)
}
)
//Here I define the schema
val schema = new StructType()
.add("uuid",StringType)
.add("timestamp", StringType)
.add("name", new StructType()
.add("firstName",StringType)
.add("middleName",StringType)
.add("lastName",StringType)
.add("contacts", new StructType(
Array(
StructField("contactType", StringType),
StructField("areaCode", StringType),
StructField("number", StringType)
)))
//Now I try to create a Dataframe using the RDD[Row] and the schema
val dataFrame = sqlContext.createDataFrame(rowRDD , schema)
But I am getting the following error:
19/03/18 12:09:53 ERROR executor.Executor: Exception in task 0.0 in
stage 1.0 (TID 8) scala.MatchError: [HOME,05,12345678] (of class
org.apache.spark.sql.catalyst.expressions.GenericRow) at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at
org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
at
org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at
scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at
scala.collection.Iterator$$anon$10.next(Iterator.scala:312) at
scala.collection.Iterator$class.foreach(Iterator.scala:727) at
scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157) at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I tried as well to generate the contacts element as an array:
val contacts = Array(homeContact,workContact)
But then I get the following error instead:
scala.MatchError: [Lorg.apache.spark.sql.Row;#726c6aec (of class
[Lorg.apache.spark.sql.Row;)
Can anyone spot the problem?
Let's simplify your situation to your array of contacts. That's where the problem is. You are trying to use this schema:
val schema = new StructType()
.add("contacts", new StructType(
Array(
StructField("contactType", StringType),
StructField("areaCode", StringType),
StructField("number", StringType)
)))
to store a list of contacts, which is a struct type. Yet, this schema cannot contain a list, just one contact. We can verify it with:
spark.createDataFrame(sc.parallelize(Seq[Row]()), schema).printSchema
root
|-- contacts: struct (nullable = true)
| |-- contactType: string (nullable = true)
| |-- areaCode: string (nullable = true)
| |-- number: string (nullable = true)
Indeed, the Array you have in your code is just meant to contain the fields of your "contacts" struct type.
To achieve what you want, a type exists: ArrayType. This yields a slightly different result:
val schema_ok = new StructType()
.add("contacts", ArrayType(new StructType(Array(
StructField("contactType", StringType),
StructField("areaCode", StringType),
StructField("number", StringType)))))
spark.createDataFrame(sc.parallelize(Seq[Row]()), schema_ok).printSchema
root
|-- contacts: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- contactType: string (nullable = true)
| | |-- areaCode: string (nullable = true)
| | |-- number: string (nullable = true)
and it works:
val row = Row(Array(
Row("type", "code", "number"),
Row("type2", "code2", "number2")))
spark.createDataFrame(sc.parallelize(Seq(row)), schema_ok).show(false)
+-------------------------------------------+
|contacts |
+-------------------------------------------+
|[[type,code,number], [type2,code2,number2]]|
+-------------------------------------------+
So if you update the schema with this version of "contacts", just replace val contacts = Row(homeContact,workContact) by val contacts = Array(homeContact,workContact) and it should work.
NB: if you want label your contacts (with HOME or WORK), there exists a MapType type as well.

Spark UDF to split a column value to multiple columns

I have a dataframe column called 'description' value in the below format
ABC XXXXXXXXXXXX STORE NAME ABC TYPE1
I will like to parse it into different 3 columns like below
| mode | type | store | description |
|------------------------------------------------------------------------|
| ABC | TYPE1 | STORE NAME | ABC XXXXXXXXXXXX STORE NAME ABC TYPE1 |
I tried the method suggested in the like here. It works for simple UDF function but not for the function I have written. The challenge is that the value of store could be more 2 words or no fixed number of words.
def myFunc1: (String => (String, String, String)) = { description =>
var descripe = description.split(" ")
val type = descripe(descripe.size - 1)
descripe = description.substring(description.indexOf("ABC") + 4, description.lastIndexOf("ABC")).split(" ")
val mode = descripe(0)
descripe(0) = ""
val store = descripe.mkString(" ").trim
(mode, store, type)
}
val schema = StructType(Array(
StructField("mode", StringType, true),
StructField("store", StringType, true),
StructField("type", StringType, true)
))
val myUDF = udf(myFunc1, schema)
val test = pos.withColumn("test", myUDF(col("description")))
test.printSchema()
val a =test.withColumn("mode", col("test").getItem("_1"))
.withColumn("store", col("test").getItem("_2"))
.withColumn("type", col("test").getItem("_3"))
.drop(col("test"))
a.printSchema()
a.show(5, false)
I get the below error when I execute
18/10/06 21:38:02 ERROR Executor: Exception in task 0.0 in stage 5.0
(TID 5) org.apache.spark.SparkException: Failed to execute user
defined function($anonfun$myFunc1$1$1: (string) =>
struct(mode:string,store:string,type:string)) at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source) at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at
org.apache.spark.scheduler.Task.run(Task.scala:108) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) Caused by:
java.lang.StringIndexOutOfBoundsException: String index out of range:
-4 at java.lang.String.substring(String.java:1967) at com.hasif.bank.track.trasaction.TransactionParser$$anonfun$myFunc1$1$1.apply(TransactionParser.scala:26)
at
com.hasif.bank.track.trasaction.TransactionParser$$anonfun$myFunc1$1$1.apply(TransactionParser.scala:22)
... 16 more
Any pointers on this will be appreciated.
Check this out.
scala> val df = Seq("ABC XXXXXXXXXXXX STORE NAME ABC TYPE1").toDF("desc")
df: org.apache.spark.sql.DataFrame = [desc: string]
scala> df.withColumn("mode",split('desc," ")(0)).withColumn("type",split('desc," ")(5)).withColumn("store",concat(split('desc," ")(2), lit(" "), split('desc," ")(3))).show(false)
+-------------------------------------+----+-----+----------+
|desc |mode|type |store |
+-------------------------------------+----+-----+----------+
|ABC XXXXXXXXXXXX STORE NAME ABC TYPE1|ABC |TYPE1|STORE NAME|
+-------------------------------------+----+-----+----------+
scala>
Update1:
scala> def splitStore(x:String):String=
| return x.split(" ").drop(2).init.init.mkString(" ")
splitStore: (x: String)String
scala> val mysplitstore = udf(splitStore(_:String):String)
mysplitstore: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> val df2 = Seq("ABC XXXXXXXXXXXX STORE NAME XYZ ABC TYPE1").toDF("desc")
df2: org.apache.spark.sql.DataFrame = [desc: string]
scala> val df3 = df2.withColumn("length",split('desc," "))
df3: org.apache.spark.sql.DataFrame = [desc: string, length: array<string>]
scala> val df4 = df3.withColumn("mode",split('desc," ")(size('length)-2)).withColumn("type",split('desc," ")(size('length)-1)).withColumn("store",mysplitstore('desc))
df4: org.apache.spark.sql.DataFrame = [desc: string, length: array<string> ... 3 more fields]
scala> df4.drop('length).show(false)
+-----------------------------------------+----+-----+--------------+
|desc |mode|type |store |
+-----------------------------------------+----+-----+--------------+
|ABC XXXXXXXXXXXX STORE NAME XYZ ABC TYPE1|ABC |TYPE1|STORE NAME XYZ|
+-----------------------------------------+----+-----+--------------+
scala>

Convert spark dataframe map column to json

I have a dataframe with below schema and sample record
root
|-- name: string (nullable = true)
|-- matches: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = false)
+---------------+------------------------------------------------------------------------------------------+
|name |matches |
+---------------+------------------------------------------------------------------------------------------+
|CVS_Extra |Map(MLauer -> 1, MichaelBColeman -> 1, OhioFoodbanks -> 1, 700wlw -> 1, cityofdayton -> 1)|
I am trying to convert map type column to json using below code(json4s library):
val d = countDF.map( row => (row(0),convertMapToJSON(row(1).asInstanceOf[Map[String, Int]]).toString()))
But fails with
java.lang.ClassNotFoundException: scala.Any
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at scala.reflect.runtime.JavaMirrors$JavaMirror.javaClass(JavaMirrors.scala:555)
at scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1210)
at scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1202)
at scala.reflect.runtime.TwoWayCaches$TwoWayCache$$anonfun$toJava$1.apply(TwoWayCaches.scala:50)
at scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19)
at scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16)
at scala.reflect.runtime.TwoWayCaches$TwoWayCache.toJava(TwoWayCaches.scala:45)
at scala.reflect.runtime.JavaMirrors$JavaMirror.classToJava(JavaMirrors.scala:1202)
at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:194)
at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:54)
at org.apache.spark.sql.catalyst.ScalaReflection$.getClassFromType(ScalaReflection.scala:682)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor(ScalaReflection.scala:84)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:614)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)
Scala Version - 2.11, json4s-jackson_2.11 & spark 2.2.0
Can anyone please suggest how to overcome this error. Thanks in advance.
Your code fails because you incorrectly use apply method. You should use for example:
countDF.map(row =>
(row.getString(0), convertMapToJSON(getMap[String, Int](1)).toString())
)
For more see Spark extracting values from a Row.
But all you need is select / withColumn with to_json:
import org.apache.spark.sql.functions.to_json
countDF.withColumn("matches", to_json($"matches"))
and if your function uses more complex logic use udf
import org.apache.spark.sql.functions.udf
val convert_map_to_json = udf(
(map: Map[String, Int]) => convertMapToJSON(map).toString
)
countDF.withColumn("matches", convert_map_to_json($"matches"))

Spark: convert rdd[row] to dataframe where one of the columns in the row is a list

I have a rdd[row] with the following data for each row
[guid, List(peopleObjects)]
["123", List(peopleObjects1, peopleObjects2, peopleObjects3)]
I want to convert this to a dataframe
I am using the following code
val personStructureType = new StructType()
.add(StructField("guid", StringType, true))
.add(StructField("personList", StringType, true))
val personDF = hiveContext.createDataFrame(personRDD, personStructureType)
Should I be using a different datatype for my schema instead of StringType?
If my list is just a string it works but when its a List I get the following error
scala.MatchError: List(personObject1, personObject2, personObject3) (of class scala.collection.immutable.$colon$colon)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:445)
at org.apache.spark.sql.SQLContext$$anonfun$7.apply(SQLContext.scala:445)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:219)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
It's not entirely clear what you are trying to do, but a better way to do what you are trying to do is to create a case class and then just map your RDD lines to the case class, then call toDF.
Something like:
case class MyClass(guid: Int, peopleObjects: List[String])
val rdd = sc.parallelize(Array((123,List("a","b")),(1232,List("b","d"))))
val df = rdd.map(r => MyClass(r._1, r._2)).toDF
df.show
+----+-------------+
|guid|peopleObjects|
+----+-------------+
| 123| [a, b]|
|1232| [b, d]|
+----+-------------+
Or you can do it the long-hand way, but without using the case class, like this:
val df = sqlContext.createDataFrame(
rdd.map(r => Row(r._1, r._2)),
StructType(Array(
StructField("guid",IntegerType),
StructField("peopleObjects", ArrayType(StringType))
))
)