I am trying to write the following pyspark dataframe to csv by using
df_final.write.csv("v1_results/run1",header=True, emptyValue='',nullValue='')
Please help me locate on why this issue
Below is the schema associated with the dataframe
root
|-- a: string (nullable = true)
|-- id: string (nullable = true)
|-- b_name: string (nullable = true)
|-- sim: float (nullable = true)
Below is the stack of the error
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 16 in stage 17.0 failed 4 times, most recent failure: Lost task 16.3 in stage 17.0 (TID 1074) (cluster-a183-w-1.us-central1-c.c.analytics-online-data-sci-thd.internal executor 8): java.lang.RuntimeException: Error while decoding: java.lang.NullPointerException: Null value appeared in non-nullable field:
- array element class: "scala.Double"
- root class: "scala.collection.Seq"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).
mapobjects(lambdavariable(MapObject, DoubleType, true, -1), assertnotnull(lambdavariable(MapObject, DoubleType, true, -1)), input[0, array<double>, true], Some(interface scala.collection.Seq))
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$Deserializer.apply(ExpressionEncoder.scala:186)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$scalaConverter$2(ScalaUDF.scala:159)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.ContextAwareIterator.hasNext(ContextAwareIterator.scala:39)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1160)
at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1214)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1217)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:307)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$1.writeIteratorToStream(PythonUDFRunner.scala:53)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:397)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1996)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:232)
Caused by: java.lang.NullPointerException: Null value appeared in non-nullable field:
- array element class: "scala.Double"
- root class: "scala.collection.Seq"
If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int).
Related
This question is the continuation of this other one, where the user
who gave the valid answer requested a new question to explain my further doubts.
What I am trying is to generate a dataframe from a RDD[Objects] where my objects has got primitive types, but also complex types. In the previous questions it was explained how to parse a complex type Map.
What I tried next is to extrapolate the given solution to parse a Map[Map]. So in the DataFrame it is converted into a Array(Map).
Below I give the code I have written so far:
//I get an Object from Hbase here
val objectRDD : RDD[HbaseRecord] = ...
//I convert the RDD[HbaseRecord] into RDD[Row]
val rowRDD : RDD[Row] = objectRDD.map(
hbaseRecord => {
val uuid : String = hbaseRecord.uuid
val timestamp : String = hbaseRecord.timestamp
val name = Row(hbaseRecord.nameMap.firstName.getOrElse(""),
hbaseRecord.nameMap.middleName.getOrElse(""),
hbaseRecord.nameMap.lastName.getOrElse(""))
val contactsMap = hbaseRecord.contactsMap
val homeContactMap = contactsMap.get("HOME")
val homeContact = Row(homeContactMap.contactType,
homeContactMap.areaCode,
homeContactMap.number)
val workContactMap = contactsMap.get("WORK")
val workContact = Row(workContactMap.contactType,
workContactMap.areaCode,
workContactMap.number)
val contacts = Row(homeContact,workContact)
Row(uuid, timestamp, name, contacts)
}
)
//Here I define the schema
val schema = new StructType()
.add("uuid",StringType)
.add("timestamp", StringType)
.add("name", new StructType()
.add("firstName",StringType)
.add("middleName",StringType)
.add("lastName",StringType)
.add("contacts", new StructType(
Array(
StructField("contactType", StringType),
StructField("areaCode", StringType),
StructField("number", StringType)
)))
//Now I try to create a Dataframe using the RDD[Row] and the schema
val dataFrame = sqlContext.createDataFrame(rowRDD , schema)
But I am getting the following error:
19/03/18 12:09:53 ERROR executor.Executor: Exception in task 0.0 in
stage 1.0 (TID 8) scala.MatchError: [HOME,05,12345678] (of class
org.apache.spark.sql.catalyst.expressions.GenericRow) at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
at
org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
at
org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
at
org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at
scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at
scala.collection.Iterator$$anon$10.next(Iterator.scala:312) at
scala.collection.Iterator$class.foreach(Iterator.scala:727) at
scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157) at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I tried as well to generate the contacts element as an array:
val contacts = Array(homeContact,workContact)
But then I get the following error instead:
scala.MatchError: [Lorg.apache.spark.sql.Row;#726c6aec (of class
[Lorg.apache.spark.sql.Row;)
Can anyone spot the problem?
Let's simplify your situation to your array of contacts. That's where the problem is. You are trying to use this schema:
val schema = new StructType()
.add("contacts", new StructType(
Array(
StructField("contactType", StringType),
StructField("areaCode", StringType),
StructField("number", StringType)
)))
to store a list of contacts, which is a struct type. Yet, this schema cannot contain a list, just one contact. We can verify it with:
spark.createDataFrame(sc.parallelize(Seq[Row]()), schema).printSchema
root
|-- contacts: struct (nullable = true)
| |-- contactType: string (nullable = true)
| |-- areaCode: string (nullable = true)
| |-- number: string (nullable = true)
Indeed, the Array you have in your code is just meant to contain the fields of your "contacts" struct type.
To achieve what you want, a type exists: ArrayType. This yields a slightly different result:
val schema_ok = new StructType()
.add("contacts", ArrayType(new StructType(Array(
StructField("contactType", StringType),
StructField("areaCode", StringType),
StructField("number", StringType)))))
spark.createDataFrame(sc.parallelize(Seq[Row]()), schema_ok).printSchema
root
|-- contacts: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- contactType: string (nullable = true)
| | |-- areaCode: string (nullable = true)
| | |-- number: string (nullable = true)
and it works:
val row = Row(Array(
Row("type", "code", "number"),
Row("type2", "code2", "number2")))
spark.createDataFrame(sc.parallelize(Seq(row)), schema_ok).show(false)
+-------------------------------------------+
|contacts |
+-------------------------------------------+
|[[type,code,number], [type2,code2,number2]]|
+-------------------------------------------+
So if you update the schema with this version of "contacts", just replace val contacts = Row(homeContact,workContact) by val contacts = Array(homeContact,workContact) and it should work.
NB: if you want label your contacts (with HOME or WORK), there exists a MapType type as well.
I have a dataframe with below schema and sample record
root
|-- name: string (nullable = true)
|-- matches: map (nullable = true)
| |-- key: string
| |-- value: integer (valueContainsNull = false)
+---------------+------------------------------------------------------------------------------------------+
|name |matches |
+---------------+------------------------------------------------------------------------------------------+
|CVS_Extra |Map(MLauer -> 1, MichaelBColeman -> 1, OhioFoodbanks -> 1, 700wlw -> 1, cityofdayton -> 1)|
I am trying to convert map type column to json using below code(json4s library):
val d = countDF.map( row => (row(0),convertMapToJSON(row(1).asInstanceOf[Map[String, Int]]).toString()))
But fails with
java.lang.ClassNotFoundException: scala.Any
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:338)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at scala.reflect.runtime.JavaMirrors$JavaMirror.javaClass(JavaMirrors.scala:555)
at scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1210)
at scala.reflect.runtime.JavaMirrors$JavaMirror$$anonfun$classToJava$1.apply(JavaMirrors.scala:1202)
at scala.reflect.runtime.TwoWayCaches$TwoWayCache$$anonfun$toJava$1.apply(TwoWayCaches.scala:50)
at scala.reflect.runtime.Gil$class.gilSynchronized(Gil.scala:19)
at scala.reflect.runtime.JavaUniverse.gilSynchronized(JavaUniverse.scala:16)
at scala.reflect.runtime.TwoWayCaches$TwoWayCache.toJava(TwoWayCaches.scala:45)
at scala.reflect.runtime.JavaMirrors$JavaMirror.classToJava(JavaMirrors.scala:1202)
at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:194)
at scala.reflect.runtime.JavaMirrors$JavaMirror.runtimeClass(JavaMirrors.scala:54)
at org.apache.spark.sql.catalyst.ScalaReflection$.getClassFromType(ScalaReflection.scala:682)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$dataTypeFor(ScalaReflection.scala:84)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:614)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)
Scala Version - 2.11, json4s-jackson_2.11 & spark 2.2.0
Can anyone please suggest how to overcome this error. Thanks in advance.
Your code fails because you incorrectly use apply method. You should use for example:
countDF.map(row =>
(row.getString(0), convertMapToJSON(getMap[String, Int](1)).toString())
)
For more see Spark extracting values from a Row.
But all you need is select / withColumn with to_json:
import org.apache.spark.sql.functions.to_json
countDF.withColumn("matches", to_json($"matches"))
and if your function uses more complex logic use udf
import org.apache.spark.sql.functions.udf
val convert_map_to_json = udf(
(map: Map[String, Int]) => convertMapToJSON(map).toString
)
countDF.withColumn("matches", convert_map_to_json($"matches"))
When I want to rename columns of my DataFrame in Spark 2.2 and print its content using show(), I get the following errors:
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'cluster' is backed by an array but the associated Spark Schema does not reflect this;
(use es.read.field.as.array.include/exclude)
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'project' is backed by an array but the associated Spark Schema does not reflect this;
(use es.read.field.as.array.include/exclude)
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'client' is backed by an array but the associated Spark Schema does not reflect this;
(use es.read.field.as.array.include/exclude)
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'twitter_mentioned_user' is backed by an array but the associated Spark Schema does not reflect this;
(use es.read.field.as.array.include/exclude)
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'author' is backed by an array but the associated Spark Schema does not reflect this;
(use es.read.field.as.array.include/exclude)
18/01/04 12:05:37 WARN ScalaRowValueReader: Field 'cluster' is backed by an array but the associated Spark Schema does not reflect this;
(use es.read.field.as.array.include/exclude)
18/01/04 12:05:37 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 7)
scala.MatchError: Buffer(13145439) (of class scala.collection.convert.Wrappers$JListWrapper)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:276)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:275)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:379)
at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:61)
at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:58)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
Caused by: scala.MatchError: Buffer(13145439) (of class scala.collection.convert.Wrappers$JListWrapper)
I printed the schema and it looks as follows:
df_processed
.withColumn("srcId", toInt(df_processed("srcId")))
.withColumn("dstId", toInt(df_processed("dstId")))
.withColumn("attr", rand).printSchema()
Output:
root
|-- srcId: integer (nullable = true)
|-- dstId: integer (nullable = true)
|-- attr: double (nullable = false)
The error occurs when I run this code:
df_processed
.withColumn("srcId", toInt(df_processed("srcId")))
.withColumn("dstId", toInt(df_processed("dstId")))
.withColumn("attr", rand).show()
It occurs when I add .withColumn("attr", rand), but it works when I use .withColumn("attr2", lit(0)).
UPDATE:
df_processed.printSchema()
root
|-- srcId: double (nullable = true)
|-- dstId: double (nullable = true)
df_processed.show() does not give any error.
Here is the similar example that you are trying to do, To cast the data type you can use cast function
val ds = Seq(
(1.2, 3.5),
(1.2, 3.5),
(1.2, 3.5)
).toDF("srcId", "dstId")
ds.withColumn("srcId", $"srcId".cast(IntegerType))
.withColumn("dstId", $"dstId".cast(IntegerType))
.withColumn("attr", rand)
Hope this helps!
You can add a UDF function:
val getRandom= udf(()=> {scala.math.random})
val df2 = df.withColumn("randomColumn", getRandom())
I'm trying to retrieve a list from a row with the following schema element.
[info] |-- ARRAY_FIELD: array (nullable = false)
[info] | |-- element: string (containsNull = false)
When printing using
row.getAs[WrappedArray[String]]("ARRAY_FIELD")
I get the following result
WrappedArray(Some String value)
But when I attempt to print the data at that index as a list
using....
row.getList(0)
I get the following exception
java.lang.ClassCastException: java.math.BigDecimal cannot be cast to scala.collection.Seq
Does anyone have any ideas on why this happens and how it can be resolved?
I was actually pulling from the wrong index in the schema. I assumed that the index for getList was based on the index of the elements shown when using df.printSchema. But I was wrong. Off by 6 postions.
Given a dataframe in which one column is a sequence of structs generated by the following sequence
val df = spark
.range(10)
.map((i) => (i % 2, util.Random.nextInt(10), util.Random.nextInt(10)))
.toDF("a","b","c")
.groupBy("a")
.agg(collect_list(struct($"b",$"c")).as("my_list"))
df.printSchema
df.show(false)
Outputs
root
|-- a: long (nullable = false)
|-- my_list: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- b: integer (nullable = false)
| | |-- c: integer (nullable = false)
+---+-----------------------------------+
|a |my_list |
+---+-----------------------------------+
|0 |[[0,3], [9,5], [3,1], [4,2], [3,3]]|
|1 |[[1,7], [4,6], [5,9], [6,4], [3,9]]|
+---+-----------------------------------+
I need to run a function over each struct list. The function prototype is similar to the function below
case class DataPoint(b: Int, c: Int)
def do_something_with_data(data: Seq[DataPoint]): Double = {
// This is an example. I don't actually want the sum
data.map(data_point => data_point.b + data_point.c).sum
}
I want to store the result of this function to another DataFrame column.
I tried to run
val my_udf = udf(do_something_with_data(_))
val df_with_result = df.withColumn("result", my_udf($"my_list"))
df_with_result.show(false)
and got
17/07/13 12:33:42 WARN TaskSetManager: Lost task 0.0 in stage 15.0 (TID 225, REDACTED, executor 0): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (array<struct<b:int,c:int>>) => double)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to $line27.$read$$iw$$iw$DataPoint
at $line28.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$do_something_with_data$1.apply(<console>:29)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at $line28.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.do_something_with_data(<console>:29)
at $line32.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:29)
at $line32.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:29)
Is it possible to use a UDF like this without first casting my rows to a container struct with the DataFrame API?
Doing something like:
case class MyRow(a: Long, my_list: Seq[DataPoint])
df.as[MyRow].map(_ => (a, my_list, my_udf(my_list)))
using the DataSet api works, but I'd prefer to stick with the DataFrame API if possible.
You cannot use a case-class as the input-argument of your UDF (but you can return case classes from the UDF). To map an array of structs, you can pass in a Seq[Row] to your UDF:
val my_uDF = udf((data: Seq[Row]) => {
// This is an example. I don't actually want the sum
data.map{case Row(x:Int,y:Int) => x+y}.sum
})
df.withColumn("result", my_udf($"my_list")).show
+---+--------------------+------+
| a| my_list|result|
+---+--------------------+------+
| 0|[[0,3], [5,5], [3...| 41|
| 1|[[0,9], [4,9], [6...| 54|
+---+--------------------+------+