Convert spark dataframe map column to json - scala

I have a dataframe with below schema and sample record
root
|-- name: string (nullable = true)
|-- matches: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = false)
+---------------+------------------------------------------------------------------------------------------+
|name |matches |
+---------------+------------------------------------------------------------------------------------------+
|CVS_Extra |Map(MLauer -> 1, MichaelBColeman -> 1, OhioFoodbanks -> 1, 700wlw -> 1, cityofdayton -> 1)|
I am trying to convert map type column to json using below code(json4s library):
val d = countDF.map( row => (row(0),convertMapToJSON(row(1).asInstanceOf[Map[String, Int]]).toString()))
But fails with
java.lang.ClassNotFoundException: scala.Any
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at scala.reflect.runtime.JavaMirrors$JavaMirror.javaClass(JavaMirrors.scala:555)
at scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1210)
at scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1202)
at scala.reflect.runtime.TwoWayCaches$TwoWayCache$$anonfun$toJava$1.apply(TwoWayCaches.scala:50)
at scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19)
at scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16)
at scala.reflect.runtime.TwoWayCaches$TwoWayCache.toJava(TwoWayCaches.scala:45)
at scala.reflect.runtime.JavaMirrors$JavaMirror.classToJava(JavaMirrors.scala:1202)
at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:194)
at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:54)
at org.apache.spark.sql.catalyst.ScalaReflection$.getClassFromType(ScalaReflection.scala:682)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor(ScalaReflection.scala:84)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:614)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)
Scala Version - 2.11, json4s-jackson_2.11 & spark 2.2.0
Can anyone please suggest how to overcome this error. Thanks in advance.

Your code fails because you incorrectly use apply method. You should use for example:
countDF.map(row =>
(row.getString(0), convertMapToJSON(getMap[String, Int](1)).toString())
)
For more see Spark extracting values from a Row.
But all you need is select / withColumn with to_json:
import org.apache.spark.sql.functions.to_json
countDF.withColumn("matches", to_json($"matches"))
and if your function uses more complex logic use udf
import org.apache.spark.sql.functions.udf
val convert_map_to_json = udf(
(map: Map[String, Int]) => convertMapToJSON(map).toString
)
countDF.withColumn("matches", convert_map_to_json($"matches"))

Related

SPARK java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast

i am trying to use the agg function with type safe check ,i created a case class for the dataset and defined its schema
case class JosmSalesRecord(orderId: Long,
totalOrderSales : Double,
totalOrderCount: Long)
object JosmSalesRecord extends SparkSessionWrapper {
import sparkSession.implicits._
val schema: Option[StructType] = Some(StructType(Seq(
StructField("order_id", IntegerType ,true),
StructField("total_order_sales",DoubleType,true),
StructField("total_order_count", IntegerType,true)
)))
}
DataSet
+----------+------------------+---------------+
| orderId| totalOrderSales|totalOrderCount|
+----------+------------------+---------------+
|1071131089| 433.68| 8|
|1071386263| 8848.42000000001| 343|
|1071439146|108.39999999999999| 8|
|1071349454|34950.400000000074| 512|
|1071283654|349.65000000000003| 27|
root
|-- orderId: long (nullable = false)
|-- totalOrderSales: double (nullable = false)
|-- totalOrderCount: long (nullable = false)
i am applying the
following function on the dataset.
val pv = aggregateSum.agg(typed.sum[JosmSalesRecord](_.totalOrderSales),
typed.sumLong[JosmSalesRecord](_.totalOrderCount)).toDF("totalOrderSales", "totalOrderCount")
when applying the pv.show(),i am getting the
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to com.pipeline.dataset.JosmSalesRecord
at com.FirstApp$$anonfun$7.apply(FirstApp.scala:78)
at org.apache.spark.sql.execution.aggregate.TypedSumDouble.reduce(typedaggregators.scala:32)
at org.apache.spark.sql.execution.aggregate.TypedSumDouble.reduce(typedaggregators.scala:30)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.agg_doConsume_1$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.serializefromobject_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.mapelements_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.deserializetoobject_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.agg_doAggregateWithKeysOutput_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.agg_doAggregateWithoutKey_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage9.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
Note --
i Strictly want to use import org.apache.spark.sql.expressions.scalalang.typed.sum
typed safe function. I am getting my answer while applying the sum of import org.apache.spark.sql.functions
If aggregateSum is a DataFrame, e.g.
val aggregateSum = Seq((1071131089L, 433.68, 8),(1071386263L,8848.42000000001,343)).toDF("orderId", "totalOrderSales", "totalOrderCount")
In current versions of Spark (i.e. 3.x), typed is deprecated. You can still use type safety with something like this (Spark 3):
case class OutputRecord(totalOrderSales : Double, totalOrderCount: Long)
val pv = aggregateSum.as[JosmSalesRecord]
.groupByKey(_ => 1)
.mapGroups((k,vs) => vs.map((x:JosmSalesRecord) => (x.totalOrderSales, x.totalOrderCount))
.reduceOption((x, y) => (x._1 + y._1, x._2 + y._2))
.map{case (x, y) => OutputRecord(x,y)}
.getOrElse(OutputRecord(0.0, 0L)))
pv.show()
gives
+----------------+---------------+
| totalOrderSales|totalOrderCount|
+----------------+---------------+
|9282.10000000001| 351|
+----------------+---------------+
If you have Scala cats as a dependency already, then you can also do something like this

How to create Spark DataFrame from RDD[Row] when Row contains Map[Map]

This question is the continuation of this other one, where the user
who gave the valid answer requested a new question to explain my further doubts.
What I am trying is to generate a dataframe from a RDD[Objects] where my objects has got primitive types, but also complex types. In the previous questions it was explained how to parse a complex type Map.
What I tried next is to extrapolate the given solution to parse a Map[Map]. So in the DataFrame it is converted into a Array(Map).
Below I give the code I have written so far:
//I get an Object from Hbase here
val objectRDD : RDD[HbaseRecord] = ...
//I convert the RDD[HbaseRecord] into RDD[Row]
val rowRDD : RDD[Row] = objectRDD.map(
hbaseRecord => {
val uuid : String = hbaseRecord.uuid
val timestamp : String = hbaseRecord.timestamp
val name = Row(hbaseRecord.nameMap.firstName.getOrElse(""),
hbaseRecord.nameMap.middleName.getOrElse(""),
hbaseRecord.nameMap.lastName.getOrElse(""))
val contactsMap = hbaseRecord.contactsMap
val homeContactMap = contactsMap.get("HOME")
val homeContact = Row(homeContactMap.contactType,
homeContactMap.areaCode,
homeContactMap.number)
val workContactMap = contactsMap.get("WORK")
val workContact = Row(workContactMap.contactType,
workContactMap.areaCode,
workContactMap.number)
val contacts = Row(homeContact,workContact)
Row(uuid, timestamp, name, contacts)
}
)
//Here I define the schema
val schema = new StructType()
.add("uuid",StringType)
.add("timestamp", StringType)
.add("name", new StructType()
.add("firstName",StringType)
.add("middleName",StringType)
.add("lastName",StringType)
.add("contacts", new StructType(
Array(
StructField("contactType", StringType),
StructField("areaCode", StringType),
StructField("number", StringType)
)))
//Now I try to create a Dataframe using the RDD[Row] and the schema
val dataFrame = sqlContext.createDataFrame(rowRDD , schema)
But I am getting the following error:
19/03/18 12:09:53 ERROR executor.Executor: Exception in task 0.0 in
stage 1.0 (TID 8) scala.MatchError: [HOME,05,12345678] (of class
org.apache.spark.sql.catalyst.expressions.GenericRow) at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at
org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
at
org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at
scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at
scala.collection.Iterator$$anon$10.next(Iterator.scala:312) at
scala.collection.Iterator$class.foreach(Iterator.scala:727) at
scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157) at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I tried as well to generate the contacts element as an array:
val contacts = Array(homeContact,workContact)
But then I get the following error instead:
scala.MatchError: [Lorg.apache.spark.sql.Row;#726c6aec (of class
[Lorg.apache.spark.sql.Row;)
Can anyone spot the problem?
Let's simplify your situation to your array of contacts. That's where the problem is. You are trying to use this schema:
val schema = new StructType()
.add("contacts", new StructType(
Array(
StructField("contactType", StringType),
StructField("areaCode", StringType),
StructField("number", StringType)
)))
to store a list of contacts, which is a struct type. Yet, this schema cannot contain a list, just one contact. We can verify it with:
spark.createDataFrame(sc.parallelize(Seq[Row]()), schema).printSchema
root
|-- contacts: struct (nullable = true)
| |-- contactType: string (nullable = true)
| |-- areaCode: string (nullable = true)
| |-- number: string (nullable = true)
Indeed, the Array you have in your code is just meant to contain the fields of your "contacts" struct type.
To achieve what you want, a type exists: ArrayType. This yields a slightly different result:
val schema_ok = new StructType()
.add("contacts", ArrayType(new StructType(Array(
StructField("contactType", StringType),
StructField("areaCode", StringType),
StructField("number", StringType)))))
spark.createDataFrame(sc.parallelize(Seq[Row]()), schema_ok).printSchema
root
|-- contacts: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- contactType: string (nullable = true)
| | |-- areaCode: string (nullable = true)
| | |-- number: string (nullable = true)
and it works:
val row = Row(Array(
Row("type", "code", "number"),
Row("type2", "code2", "number2")))
spark.createDataFrame(sc.parallelize(Seq(row)), schema_ok).show(false)
+-------------------------------------------+
|contacts |
+-------------------------------------------+
|[[type,code,number], [type2,code2,number2]]|
+-------------------------------------------+
So if you update the schema with this version of "contacts", just replace val contacts = Row(homeContact,workContact) by val contacts = Array(homeContact,workContact) and it should work.
NB: if you want label your contacts (with HOME or WORK), there exists a MapType type as well.

Using Spark UDFs with struct sequences

Given a dataframe in which one column is a sequence of structs generated by the following sequence
val df = spark
.range(10)
.map((i) => (i % 2, util.Random.nextInt(10), util.Random.nextInt(10)))
.toDF("a","b","c")
.groupBy("a")
.agg(collect_list(struct($"b",$"c")).as("my_list"))
df.printSchema
df.show(false)
Outputs
root
|-- a: long (nullable = false)
|-- my_list: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- b: integer (nullable = false)
| | |-- c: integer (nullable = false)
+---+-----------------------------------+
|a |my_list |
+---+-----------------------------------+
|0 |[[0,3], [9,5], [3,1], [4,2], [3,3]]|
|1 |[[1,7], [4,6], [5,9], [6,4], [3,9]]|
+---+-----------------------------------+
I need to run a function over each struct list. The function prototype is similar to the function below
case class DataPoint(b: Int, c: Int)
def do_something_with_data(data: Seq[DataPoint]): Double = {
// This is an example. I don't actually want the sum
data.map(data_point => data_point.b + data_point.c).sum
}
I want to store the result of this function to another DataFrame column.
I tried to run
val my_udf = udf(do_something_with_data(_))
val df_with_result = df.withColumn("result", my_udf($"my_list"))
df_with_result.show(false)
and got
17/07/13 12:33:42 WARN TaskSetManager: Lost task 0.0 in stage 15.0 (TID 225, REDACTED, executor 0): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (array<struct<b:int,c:int>>) => double)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to $line27.$read$$iw$$iw$DataPoint
at $line28.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$do_something_with_data$1.apply(<console>:29)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at $line28.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.do_something_with_data(<console>:29)
at $line32.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:29)
at $line32.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:29)
Is it possible to use a UDF like this without first casting my rows to a container struct with the DataFrame API?
Doing something like:
case class MyRow(a: Long, my_list: Seq[DataPoint])
df.as[MyRow].map(_ => (a, my_list, my_udf(my_list)))
using the DataSet api works, but I'd prefer to stick with the DataFrame API if possible.
You cannot use a case-class as the input-argument of your UDF (but you can return case classes from the UDF). To map an array of structs, you can pass in a Seq[Row] to your UDF:
val my_uDF = udf((data: Seq[Row]) => {
// This is an example. I don't actually want the sum
data.map{case Row(x:Int,y:Int) => x+y}.sum
})
df.withColumn("result", my_udf($"my_list")).show
+---+--------------------+------+
| a| my_list|result|
+---+--------------------+------+
| 0|[[0,3], [5,5], [3...| 41|
| 1|[[0,9], [4,9], [6...| 54|
+---+--------------------+------+

Spark 2.1: Convert RDD to Dataset with custom columns using toDS() function

I want to transform an RDD into a Dataset with custom columns using the Spark SQL native function toDS().
I don't have any errors at compilation time, but at runtime, I got the error No Encoder found for java.time.LocalDate.
Bellow, the full stack trace log:
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for java.time.LocalDate
- field (class: "java.time.LocalDate", name: "_1")
- root class: "scala.Tuple3"
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:602)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:587)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:587)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:49)
at observatory.Extraction$.locationYearlyAverageRecords(Extraction.scala:114)
at observatory.Extraction$.processExtraction(Extraction.scala:28)
at observatory.Main$.delayedEndpoint$observatory$Main$1(Main.scala:18)
at observatory.Main$delayedInit$body.apply(Main.scala:7)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at observatory.Main$.main(Main.scala:7)
at observatory.Main.main(Main.scala)
The structure of my RDD is composed of three columns, based on Tuple3 where the signature is:
type TemperatureRecord = (LocalDate, Location, Double)
Field LocalDate is the Java Object coming from package java.time.LocalDate.
Field Location is a custom type made with two Double (GPS coordinates) having this signature:
case class Location(lat: Double, lon: Double)
Below, one sample row:
(1975-01-01, Location(70.933,-8.667), -4.888888888888889)
Some details about my application / environment:
Scala: 2.11.8
Spark core: 2.1.1
Spark SQL: 2.1.1
Linux Ubuntu: 16.04 LTS
I have read from this article How to store custom objects in Dataset? that I need to define custom Encoder, but I don't have any idea :(.
The problem is that Spark does not find an encoder for regular classes. As of today Spark only allows to use primitive types for encoders and there is no good support for custom classes.
As for your case, given your "custom" class represents a date you can use java.sql.date instead java.time.LocalDate. The benefit is that you can take advantage of encoders already provided by Spark.
import java.sql.Date
case class TempRow(date: Date, loc: Location, temp: Double)
val ds = Seq(TempRow(java.sql.Date.valueOf("2017-06-01"),
Location(1.4,5.1), 4.9), TempRow(java.sql.Date.valueOf("2014-04-05"),
Location(1.5,2.5), 5.5))
.toDS
ds.show()
+----------+---------+----+
| date| loc|temp|
+----------+---------+----+
|2017-06-01|[1.4,5.1]| 4.9|
|2014-04-05|[1.5,2.5]| 5.5|
+----------+---------+----+
Check the schema:
ds.printSchema()
root
|-- date: date (nullable = true)
|-- loc: struct (nullable = true)
| |-- i: double (nullable = false)
| |-- j: double (nullable = false)
|-- temp: double (nullable = false)
For more general cases, there is one trick that you can perform to store the majority of custom classes in a Spark dataset. Bear in mind that it does not work for all cases because you need to use a string as an intermediate representation of your custom object. I hope this issue will be solved in the future because it is really a pain.
Find below one solution for your case:
case class Location(val i: Double, val j: Double)
class TempRecord(val date: java.time.LocalDate, val loc: Location, val temp: Double)
type TempSerialized = (String, Location, Double)
implicit def fromSerialized(t: TempSerialized): TempRecord = new TempRecord(java.time.LocalDate.parse(t._1), t._2, t._3)
implicit def toSerialized(t: TempRecord): TempSerialized = (t.date.toString, t.loc, t.temp)
// Finally we can create datasets
val d = spark.createDataset(Seq[TempSerialized](
new TempRecord(java.time.LocalDate.now, Location(1.0,2.0), 3.0),
new TempRecord(java.time.LocalDate.now, Location(5.0,4.0), 4.0) )
).toDF("date", "location", "temperature").as[TempSerialized]
d.show()
+----------+---------+-----------+
| date| location|temperature|
+----------+---------+-----------+
|2017-07-11|[1.0,2.0]| 3.0|
|2017-07-11|[5.0,4.0]| 4.0|
+----------+---------+-----------+
d.printSchema()
root
|-- date: string (nullable = true)
|-- location: struct (nullable = true)
| |-- i: double (nullable = false)
| |-- j: double (nullable = false)
|-- temperature: double (nullable = false)

Spark DataFrame not respecting schema and considering everything as String

I am facing a problem which I have failed to get over for ages now.
I am on Spark 1.4 and Scala 2.10. I cannot upgrade at this moment (big distributed infrastructure)
I have a file with few hundred columns, only 2 of which are string and rest all are Long. I want to convert this data into a Label/Features dataframe.
I have been able to get it into LibSVM format.
I just cannot get it into a Label/Features format.
The reason being
I am not being able to use the toDF() as shown here
https://spark.apache.org/docs/1.5.1/ml-ensembles.html
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt").toDF()
it it not supported in 1.4
So I first converted the txtFile into a DataFrame where I used something like this
def getColumnDType(columnName:String):StructField = {
if((columnName== "strcol1") || (columnName== "strcol2"))
return StructField(columnName, StringType, false)
else
return StructField(columnName, LongType, false)
}
def getDataFrameFromTxtFile(sc: SparkContext,staticfeatures_filepath: String,schemaConf: String) : DataFrame = {
val sfRDD = sc.textFile(staticfeatures_filepath)//
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// reads a space delimited string from application.properties file
val schemaString = readConf(Array(schemaConf)).get(schemaConf).getOrElse("")
// Generate the schema based on the string of schema
val schema =
StructType(
schemaString.split(" ").map(fieldName => getSFColumnDType(fieldName)))
val data = sfRDD
.map(line => line.split(","))
.map(p => Row.fromSeq(p.toSeq))
var df = sqlContext.createDataFrame(data, schema)
//schemaString.split(" ").drop(4)
//.map(s => df = convertColumn(df, s, "int"))
return df
}
When I do a df.na.drop() df.printSchema() with this returned dataframe I get perfect Schema Like this
root
|-- rand_entry: long (nullable = false)
|-- strcol1: string (nullable = false)
|-- label: long (nullable = false)
|-- strcol2: string (nullable = false)
|-- f1: long (nullable = false)
|-- f2: long (nullable = false)
|-- f3: long (nullable = false)
and so on till around f300
But - the sad part is whatever I try to do (see below) with the df, I am always getting a ClassCastException (java.lang.String cannot be cast to java.lang.Long)
val featureColumns = Array("f1","f2",....."f300")
assertEquals(-99,df.select("f1").head().getLong(0))
assertEquals(-99,df.first().get(4))
val transformeddf = new VectorAssembler()
.setInputCols(featureColumns)
.setOutputCol("features")
.transform(df)
So - the bad is - even though the schema says Long - the df is still internally considering everything as String.
Edit
Adding a simple example
Say I have a file like this
1,A,20,P,-99,1,0,0,8,1,1,1,1,131153
1,B,23,P,-99,0,1,0,7,1,1,0,1,65543
1,C,24,P,-99,0,1,0,9,1,1,1,1,262149
1,D,7,P,-99,0,0,0,8,1,1,1,1,458759
and
sf-schema=f0 strCol1 f1 strCol2 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11
(column names really do not matter so you can disregard this details)
All I am trying to do is create a Label/Features kind of dataframe where my 3rd column becomes a label and the 5th to 11th columns become a feature [Vector] column. Such that I can follow the rest of the steps in https://spark.apache.org/docs/latest/ml-classification-regression.html#tree-ensembles.
I have cast the columns too like suggested by zero323
val types = Map("strCol1" -> "string", "strCol2" -> "string")
.withDefault(_ => "bigint")
df = df.select(df.columns.map(c => df.col(c).cast(types(c)).alias(c)): _*)
df = df.drop("f0")
df = df.drop("strCol1")
df = df.drop("strCol2")
But the assert and VectorAssembler still fails.
featureColumns = Array("f2","f3",....."f11")
This is whole sequence I want to do after I have my df
var transformeddf = new VectorAssembler()
.setInputCols(featureColumns)
.setOutputCol("features")
.transform(df)
transformeddf.show(2)
transformeddf = new StringIndexer()
.setInputCol("f1")
.setOutputCol("indexedF1")
.fit(transformeddf)
.transform(transformeddf)
transformeddf.show(2)
transformeddf = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(5)
.fit(transformeddf)
.transform(transformeddf)
The exception trace from ScalaIDE - just when it hits the VectorAssembler is as below
java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Long
at scala.runtime.BoxesRunTime.unboxToLong(BoxesRunTime.java:110)
at scala.math.Numeric$LongIsIntegral$.toDouble(Numeric.scala:117)
at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$5.apply(Cast.scala:364)
at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$5.apply(Cast.scala:364)
at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:436)
at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118)
at org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypes.scala:75)
at org.apache.spark.sql.catalyst.expressions.CreateStruct$$anonfun$eval$2.apply(complexTypes.scala:75)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.catalyst.expressions.CreateStruct.eval(complexTypes.scala:75)
at org.apache.spark.sql.catalyst.expressions.CreateStruct.eval(complexTypes.scala:56)
at org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2.apply(ScalaUdf.scala:72)
at org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2.apply(ScalaUdf.scala:70)
at org.apache.spark.sql.catalyst.expressions.ScalaUdf.eval(ScalaUdf.scala:960)
at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118)
at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:143)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$3.apply(SparkPlan.scala:143)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
You get ClassCastException because this is exactly what should happen. Schema argument is not used for automatic casting (some DataSources may use schema this way, but not methods like createDataFrame). It only declares what are the types of the values which are stored in the rows. It is you responsibility to pass data which matches the schema, not the other way around.
While DataFrame shows schema you've declared it is validated only on runtime, hence the runtime exception.If you want to transform data to specific you have cast data explicitly.
First read all columns as StringType:
val rows = sc.textFile(staticfeatures_filepath)
.map(line => Row.fromSeq(line.split(",")))
val schema = StructType(
schemaString.split(" ").map(
columnName => StructField(columnName, StringType, false)
)
)
val df = sqlContext.createDataFrame(rows, schema)
Next cast selected columns to desired type:
import org.apache.spark.sql.types.{LongType, StringType}
val types = Map("strcol1" -> StringType, "strcol2" -> StringType)
.withDefault(_ => LongType)
val casted = df.select(df.columns.map(c => col(c).cast(types(c)).alias(c)): _*)
Use assembler:
val transformeddf = new VectorAssembler()
.setInputCols(featureColumns)
.setOutputCol("features")
.transform(casted)
You can simply steps 1 and 2 using spark-csv:
// As originally
val schema = StructType(
schemaString.split(" ").map(fieldName => getSFColumnDType(fieldName)))
val df = sqlContext
.read.schema(schema)
.format("com.databricks.spark.csv")
.option("header", "false")
.load(staticfeatures_filepath)
See also Correctly reading the types from file in PySpark