Applying DataFrame operations to a Single row in mapWithState - scala

I'm on spark 2.1.0 with Scala 2.11. I have a requirement to store state in Map[String, Any] format for every key. The right candidate to solve my problem appears to be mapWithState() which is defined in PairDStreamFunctions. The DStream on which I am applying mapWithState() is of type DStream[Row]. Before applying mapWithState(), I do this:
dstream.map(row=> (row.get(0), row))
Now my DStream is of type Tuple2[Any, Row]. On this DStream I apply mapWithState and here's how my updatefunction looks:
def stateUpdateFunction(): (Any, Option[Row], State[Map[String, Any]]) => Option[Row] = {
(key, newData, stateData) => {
if (stateData.exists()) {
var oldState = stateData.get()
stateData.update(Map("count" -> newData.get.get(1), "sum" -> newData.get.get(2)))
Some(Row.fromSeq(newData.get.toSeq.++(Seq(oldState.get("count").get, oldState.get("sum").get))))
}
else {
stateData.update(Map("count" -> newData.get.get(1), "sum" -> newData.get.get(2)))
Some(Row.fromSeq(newData.get.toSeq.++(Seq[Any](null, null))))
}
}
}
Right now, update function only stores 2 values (per key) in the Map and appends the old values stored against "count" and "sum" to the input Row and returns. The state Map gets updated by the newly passed values in the input Row. My requirement is to be able to do complex operations on the input Row like we do on a DataFrame before storing them in the state Map. In other words I would like to be able to do something like this:
var transformedRow = originalRow.select(concat(upper($"C0"), lit("dummy")), lower($"C1") ...)
In the update-function I don't have access to SparkContext or SparkSession. So, I cannot create a single row DataFrame. If I could do that, applying DataFrame operations would not be difficult. I have all the column expressions defined for the transformed row.
Here's my sequence of operations:
readState-> Perform complex DataFrame operations using this state on input row -> Perform more complex DataFrame operations to define new values for state.
Is it possible to fetch the SparkPlan/logicalPlan corresponding to a DataFrame query/operation and apply it on a single spark-sql Row ? I would very much appreciate any leads here. Please let me know if the question is not clear or some more details are required.

I've found a not-so-efficient solution to the given problem. With the known DataFrame operations we have, we can create an empty DataFrame with an already known schema. This DataFrame can give us the SparkPlan through
DataFrame.queryExecution.sparkPlan
This object is serializable and can be passed over to stateUpdateFunction. In the stateUpdateFunction, we can iterate over expressions contained in the passed SparkPlan, transforming it to replace unresolved attributes with corresponding literals:
sparkPlan.expressions.map(expr=>{
expr.transform{
case attr: AttributeReference =>
println(s"Resolving ${attr.name} to ${map.getOrElse(attr.name, null)}, type: ${map.getOrElse(attr.name, null).getClass().toString()}")
Literal(map.getOrElse(attr.name, null))
case a => a
}
})
The map here refers to Row's column-value pairs. On these transformed expressions we call eval passing it empty InternalRow. This gives us results corresponding to every expression. Because this method involves interpreted evaluation and doesn't employ code generation, it will be inefficient to use this in a real world use-case. But I'll dig further to find out how code generation can be leveraged here.

Related

MongoTypeConversionException: Cannot cast STRING into a NullType with Mongo Spark Connector even when explicit schema does not contain NullTypes

I'm importing a collection from MongodB to Spark.
val partitionDF = spark.read.format("com.mongodb.spark.sql.DefaultSource").option("database", "db").option("collection", collectionName).load()
For the data column in the resulting DataFrame, I get this type:
StructType(StructField(configurationName,NullType,true), ...
so at least some types in some columns are NullType.
As per Writing null values to Parquet in Spark when the NullType is inside a StructType , I try fixing the schema by replacing all NullTypes with StringTypes:
def denullifyStruct(struct: StructType): StructType = {
val items = struct.map{ field => StructField(field.name, denullify(field.dataType), field.nullable, field.metadata) }
StructType(items)
}
def denullify(dt: DataType): DataType = {
if (dt.isInstanceOf[StructType]) {
val struct = dt.asInstanceOf[StructType]
return denullifyStruct(struct)
} else if (dt.isInstanceOf[ArrayType]) {
val array = dt.asInstanceOf[ArrayType]
return ArrayType(denullify(array.elementType), array.containsNull)
} else if (dt.isInstanceOf[NullType]) {
return StringType
}
return dt
}
val fixedDF = spark.createDataFrame(partitionDF.rdd, denullifyStruct(partitionDF.schema))
Issuing fixedDF.printSchema I can see that no NullType exists in the fixedDF's schema anymore. But when I try to save it to Parquet
fixedDF.write.mode("overwrite").parquet(partitionName + ".parquet")
I get the following error:
Caused by: com.mongodb.spark.exceptions.MongoTypeConversionException: Cannot cast STRING into a NullType (value: BsonString{value='117679.8'})
at com.mongodb.spark.sql.MapFunctions$.convertToDataType(MapFunctions.scala:214)
at com.mongodb.spark.sql.MapFunctions$.$anonfun$documentToRow$1(MapFunctions.scala:37)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
A NullType again!
The same issue occurs when I just count the number of rows: fixedDF.count().
Does Spark infer the schema again when writing to Parquet (or counting)? Is it possible to turn such inference off (or overcome this in some other way)?
Issue is not due to parquet write method. Error is occurring While reading data as dataframe due to some type cast problem. This jira page says we need to add samplePoolSize option along with other options while reading data from mondoDB.
The problem is that, even if you supply a DataFrame with an explicit schema, for some operations (like count() or for saving to disk) a Mongo-derived DataFrame will still infer the schema.
To infer the schema, it uses sampling which means that it does not see some data while inferring. If it only sees some field having a null value, it will infer NullType for it. And later, when it encounters this field with some string, such a string would not be able to be converted to NullType.
So the fundamental problem here is sampling. If your schema is stable and 'dense' (every or near every document has all fields filled), sampling will work well. But if some fields are 'sparse' (null with high probability), sampling could fail.
A crude solution is to avoid sampling altogether. That is, infer schema using general population and not a sample. If there is no too much data (or you are able to wait), it could work.
Here is an experimental branch: https://github.com/rpuch/mongo-spark/tree/read-full-collection-instead-of-sampling
The idea is to switch from sampling to using the whole collection if configured so. It is a bit too cumbersome to introduce a new configuration option, so I just disable sampling if 'sampleSize' configuration option is set to 1, like this:
.option("sampleSize", 1) // MAGIC! This effectively turns sampling off, instead the schema is inferred based on general population
In such a case, the sampling is avoided altogether. An obvious solution to sample using N equal to the collection size makes MongoDB sort a lot of data in memory which seems problematic. Hence I disable sampling completely.

Scala: java.lang.UnsupportedOperationException: Primitive types are not supported

I've added following code:
var counters: Map[String, Int] = Map()
val results = rdd.filter(l => l.contains("xyz")).map(l => mapEvent(l)).filter(r => r.isDefined).map (
i => {
val date = i.get.getDateTime.toString.substring(0, 10)
counters = counters.updated(date, counters.getOrElse(date, 0) + 1)
}
)
I want to get counts for different dates in the RDD in one single iteration. But when I run this I get message saying:
No implicits found for parameters evidence$6: Encoder[Unit]
So I added this line:
implicit val myEncoder: Encoder[Unit] = org.apache.spark.sql.Encoders.kryo[Unit]
But then I get this error.
Exception in thread "main" java.lang.ExceptionInInitializerError
at com.xyz.SparkBatchJob.main(SparkBatchJob.scala)
Caused by: java.lang.UnsupportedOperationException: Primitive types are not supported.
at org.apache.spark.sql.Encoders$.genericSerializer(Encoders.scala:200)
at org.apache.spark.sql.Encoders$.kryo(Encoders.scala:152)
How do I fix this? OR Is there a better way to get the counts I want in a single iteration (O(N) time)?
A Spark RDD is a representation of a distributed collection. When you apply a map function to an RDD, the function that you use to manipulate the collection is going to be executed across the cluster so there is no sense in mutating a variable created out of the scope of the map function.
In your code, the problem is because you donĀ“t return any value, instead you are trying to mutate a structure and for that reason the compiler infers that the new created RDD after the transformation is a RDD[Unit].
If you need to create a Map as a result of a Spark action you must create a pairRDD and then apply the reduce operation.
Include the type of the rdd and the mapEvent function to see how it could be done.
Spark builds a DAG with the transformation and the action, it does not process the data twice.

Spark scala data frame udf returning rows

Say I have an dataframe which contains a column (called colA) which is a seq of row. I want to to append a new field to each record of colA. (And the new filed is associated with the former record, so I have to write an udf.)
How should I write this udf?
I have tried to write a udf, which takes colA as input, and output Seq[Row] where each record contains the new filed. But the problem is the udf cannot return Seq[Row]/ The exception is 'Schema for type org.apache.spark.sql.Row is not supported'.
What should I do?
The udf that I wrote:
val convert = udf[Seq[Row], Seq[Row]](blablabla...)
And the exception is java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Row is not supported
since spark 2.0 you can create UDFs which return Row / Seq[Row], but you must provide the schema for the return type, e.g. if you work with an Array of Doubles :
val schema = ArrayType(DoubleType)
val myUDF = udf((s: Seq[Row]) => {
s // just pass data without modification
}, schema)
But I cant really imagine where this is useful, I would rather return tuples or case classes (or Seq thereof) from the UDFs.
EDIT : It could be useful if your row contains more than 22 fields (limit of fields for tuples/case classes)
This is an old question, I just wanted to update it according to the new version of Spark.
Since Spark 3.0.0, the method that #Raphael Roth has mentioned is deprecated. Hence, you might get an AnalysisException. The reason is that the input closure using this method doesn't have type checking and the behavior might be different from what we expect in SQL when it comes to null values.
If you really know what you're doing, you need to set spark.sql.legacy.allowUntypedScalaUDF configuration to true.
Another solution is to use case class instead of schema. For example,
case class Foo(field1: String, field2: String)
val convertFunction: Seq[Row] => Seq[Foo] = input => {
input.map {
x => // do something with x and convert to Foo
}
}
val myUdf = udf(convertFunction)

Dataset.groupByKey + untyped aggregation functions

Suppose I have types like these:
case class SomeType(id: String, x: Int, y: Int, payload: String)
case class Key(x: Int, y: Int)
Then suppose I did groupByKey on a Dataset[SomeType] like this:
val input: Dataset[SomeType] = ...
val grouped: KeyValueGroupedDataset[Key, SomeType] =
input.groupByKey(s => Key(s.x, s.y))
Then suppose I have a function which determines which field I want to use in an aggregation:
val chooseDistinguisher: SomeType => String = _.id
And now I would like to run an aggregation function over the grouped dataset, for example, functions.countDistinct, using the field obtained by the function:
grouped.agg(
countDistinct(<something which depends on chooseDistinguisher>).as[Long]
)
The problem is, I cannot create a UDF from chooseDistinguisher, because countDistinct accepts a Column, and to turn a UDF into a Column you need to specify the input column names, which I cannot do - I do not know which name to use for the "values" of a KeyValueGroupedDataset.
I think it should be possible, because KeyValueGroupedDataset itself does something similar:
def count(): Dataset[(K, Long)] = agg(functions.count("*").as(ExpressionEncoder[Long]()))
However, this method cheats a bit because it uses "*" as the column name, but I need to specify a particular column (i.e. the column of the "value" in a key-value grouped dataset). Also, when you use typed functions from the typed object, you also do not need to specify the column name, and it works somehow.
So, is it possible to do this, and if it is, how to do it?
As I know it's not possible with agg transformation, which expects TypedColumn type which is constructed based on Column type using as method, so you need to start from not type-safe expression. If somebody knows solution I would be interested to see it...
If you need to use type-safe aggregation you can use one of below approaches:
mapGroups - where you can implement Scala function responsible for aggregating Iterator
implement your custom Aggregator as suggested above
First approach needs less code, so below I'm showing quick example:
def countDistinct[T](values: Iterator[T])(chooseDistinguisher: T => String): Long =
values.map(chooseDistinguisher).toSeq.distinct.size
ds
.groupByKey(s => Key(s.x, s.y))
.mapGroups((k,vs) => (k, countDistinct(vs)(_.name)))
In my opinion Spark Dataset type-safe API is still much less mature than not type safe DataFrame API. Some time ago I was thinking that it could be good idea to implement simple to use type-safe aggregation API for Spark Dataset.
Currently, this use case is better handled with DataFrame, which you can later convert back into a Dataset[A].
// Code assumes SQLContext implicits are present
import org.apache.spark.sql.{functions => f}
val colName = "id"
ds.toDF
.withColumn("key", f.concat('x, f.lit(":"), 'y))
.groupBy('key)
.agg(countDistinct(f.col(colName)).as("cntd"))

How to create udf containing Array (case class) for complex column in a dataframe

I have a dataframe which have a complex column datatype of Arraytype>. For transforming this dataframe I have created udf which can consume this column using Array [case class] as parameter. The main bottle neck here is when I create case class according to stucttype, the structfield name contains special characters for example "##field". So I provide same name to case class like this way case class (##field) and attach this to udf parameter. After interpreted in spark udf definition change name of case class field to this "$hash$hashfield". When performing transform using this dataframe it is failing because of this miss match. Please help ...
Due JVM limitations Scala stores identifiers in encoded form and currently Spark can't map ##field to $hash$hashfield.
One possible solution is to extract fields manually from raw row (but you need to know order of the fields in df, you can use df.schema for that):
val myUdf = udf { (struct: Row) =>
// Pattern match struct:
struct match {
case Row(a: String) => Foo(a)
}
// .. or extract values from Row
val `##a` = struct.getAs[String](0)
}