How to extract JSON from a binary protobuf? - scala

Considering a Apache Spark 2.2.0 Structured Stream as:
jsonStream.printSchema()
root
|-- body: binary (nullable = true)
The data inside body is of type Protocol Buffers v2 and a nested JSON. It looks like
syntax = "proto2";
message Data {
required string data = 1;
}
message List {
repeated Data entry = 1;
}
How can I extract the data inside Spark to "further" process it?
I looked into ScalaPB, but as I run my code in Jupyter couldn't get the ".proto" code to be included inline. I also do not know how to convert a DataFrame to an RDD on a stream. Trying .rdd failed because of a streaming source.
Update 1: I figured out how to generate Scala files from protobuf specifications, using the console tool of ScalaPB. Still I'm not able to import them as of a "type mismatch".

tl;dr Write a user-defined function (UDF) to deserialize the binary field (of protobuf with a JSON) to JSON.
Think of the serialized body (in binary format) as a table column. Forget about Structured Streaming for a moment (and streaming Datasets).
Let me then rephrase the question to the following:
How to convert (aka cast) a value in binary to [here-your-format]?
Some formats are directly cast-able which makes converting binaries to strings as easy as follows:
$"body" cast "string"
If the string is then a JSON or unixtime you could use built-in "converters", i.e. functions like from_json or from_unixtime.
The introduction should give you a hint how to do conversions like yours.
The data inside body is of type Protocol Buffers v2 and a nested JSON.
To deal with such fields (protobuf + json) you'd have to write a Scala function to decode the "payload" to JSON and create a user-defined function (UDF) using udf:
udf(f: UDF1[_, _], returnType: DataType): UserDefinedFunction Defines a Java UDF1 instance as user-defined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API UserDefinedFunction.asNondeterministic().
Then use functions like from_json or get_json_object.
To make your case simpler, write a single-argument function that does the conversion and wrap it into a UDF using udf function.
Trying .rdd failed because of a streaming source.
Use Dataset.foreach or foreachPartition.
foreach(f: (T) ⇒ Unit): Unit Applies a function f to all rows.
foreachPartition(f: (Iterator[T]) ⇒ Unit): Unit Applies a function f to each partition of this Dataset.

Related

Non-relational database in statically typed languages (rethinkdb, Scala)

I'm still pretty new to Scala and arrived at some kind of typing-roadblock.
Non-SQL databases such as mongo and rethinkdb do not enforce any scheme for their tables and manage data in json format. I've been struggling to get the java API for rethinkdb to work on Scala and there seems to be surprisingly low information on how to actually use the results returned from the database.
Assuming a simple document schema such as this:
{
"name": "melvin",
"age": 42,
"tags": ["solution"]
}
I fail to get how to actually this data in Scala. After running a query, for example, by running something like r.table("test").run(connection), I receive an object from which I can iterate AnyRef objects. In the python word, this most likely would be a simple dict. How do I convey the structure of this data to Scala, so I can use it in code (e.g., query fields of the returned documents)?
From a quick scan of the docs and code, the Java Rethink client uses Jackson to handle deserialization of the JSON received from the DB into JVM objects. Since by definition every JSON object received is going to be deserializable into a JSON AST (Abstract Syntax Tree: a representation in plain Scala objects of the structure of a JSON document), you could implement a custom Jackson ObjectMapper which, instead of doing the usual Jackson magic with reflection, always deserializes into the JSON AST.
For example, Play JSON defers the actual serialization/deserialization to/from JSON to Jackson: it installs a module into a vanilla ObjectMapper which specially takes care of instances of JsValue, which is the root type of Play JSON's AST. Then something like this should work:
import com.fasterxml.jackson.databind.ObjectMapper
import play.api.libs.json.jackson.PlayJsonModule
// Use Play JSON's ObjectMapper... best to do this before connecting
RethinkDB.setResultMapper(new ObjectMapper().registerModule(new PlayJsonModule(JsonParserSettings())))
run(connection) returns a Result[AnyRef] in Scala notation. There's an alternative version, run(connection, typeRef), where the second argument specifies a result type; this is passed to the ObjectMapper to ensure that every document will either fail to deserialize or be an instance of that result type:
import play.api.libs.json.JsValue
val result = r.table("table").run(connection, classOf[JsValue]) : Result[JsValue]
You can then get the next element from the result as a JsValue and use the usual Play JSON machinery to convert the JsValue into your domain type:
import play.api.libs.json.Json
case class MyDocument(name: String, age: Int, tags: Seq[String])
object MyDocument {
implicit val jsonFormat = Json.format[MyDocument]
}
// result is a Result[JsValue] ... may need an import MyDocument.jsonFormat or similar
val myDoc = Json.fromJson[MyDocument](result.next()).asOpt[MyDocument] : Option[MyDocument]
There's some ability with enrichments to improve the Scala API to make a lot of this machinery more transparent.
You could do similar things with the other Scala JSON ASTs (e.g. Circe, json4s), but might have to implement functionality similar to what Play does with the ObjectMapper yourself.

What are Untyped Scala UDF and Typed Scala UDF? What are their differences?

I've been using Spark 2.4 for a while and just started switching to Spark 3.0 in these last few days. I got this error after switching to Spark 3.0 for running udf((x: Int) => x, IntegerType):
Caused by: org.apache.spark.sql.AnalysisException: You're using untyped Scala UDF, which does not have the input type information. Spark may blindly pass null to the Scala closure with primitive-type argument, and the closure will see the default value of the Java type for the null argument, e.g. `udf((x: Int) => x, IntegerType)`, the result is 0 for null input. To get rid of this error, you could:
1. use typed Scala UDF APIs(without return type parameter), e.g. `udf((x: Int) => x)`
2. use Java UDF APIs, e.g. `udf(new UDF1[String, Integer] { override def call(s: String): Integer = s.length() }, IntegerType)`, if input types are all non primitive
3. set spark.sql.legacy.allowUntypedScalaUDF to true and use this API with caution;
The solutions are proposed by Spark itself and after googling for a while I got to Spark Migration guide page:
In Spark 3.0, using org.apache.spark.sql.functions.udf(AnyRef, DataType) is not allowed by default. Remove the return type parameter to automatically switch to typed Scala udf is recommended, or set spark.sql.legacy.allowUntypedScalaUDF to true to keep using it. In Spark version 2.4 and below, if org.apache.spark.sql.functions.udf(AnyRef, DataType) gets a Scala closure with primitive-type argument, the returned UDF returns null if the input values is null. However, in Spark 3.0, the UDF returns the default value of the Java type if the input value is null. For example, val f = udf((x: Int) => x, IntegerType), f($"x") returns null in Spark 2.4 and below if column x is null, and return 0 in Spark 3.0. This behavior change is introduced because Spark 3.0 is built with Scala 2.12 by default.
source: Spark Migration Guide
I notice that my usual way of using function.udf API, which is udf(AnyRef, DataType), is called UnTyped Scala UDF and the proposed solution, which is udf(AnyRef), is called Typed Scala UDF.
To my understanding, the first one looks more strictly typed than the second one where the first one has its output type explicitly defined and the second one does not, hence my confusion on why it's called UnTyped.
Also the function got passed to udf, which is (x:Int) => x, clearly has its input type defined but Spark claiming You're using untyped Scala UDF, which does not have the input type information?
Is my understanding correct? Even after more intensive searching I still can't find any material explaining what is UnTyped Scala UDF and what is Typed Scala UDF.
So my questions are: What are they? What are their differences?
In typed scala UDF, UDF knows the types of the columns passed as argument, whereas in untyped scala UDF, UDF doesn't know the types of the columns passed as argument
When creating typed scala UDF, the types of columns passed as argument and output of the UDF are inferred from the function arguments and output types whereas when creating untyped scala UDF, there is not type inference at all, either for arguments or output.
What can be confusing is that when creating typed UDF the types are inferred from function and not explicitly passed as argument. To be more explicit, you can write typed UDF creation as follow:
val my_typed_udf = udf[Int, Int]((x: Int) => Int)
Now, let's look at the two points you raised.
To my understanding, the first one (eg udf(AnyRef, DataType)) looks more strictly typed than the second one (eg udf(AnyRef)) where the first one has its output type explicitly defined and the second one does not, hence my confusion on why it's called UnTyped.
According to spark functions scaladoc, signatures of the udf functions that transform a function to an UDF are actually, for the first one:
def udf(f: AnyRef, dataType: DataType): UserDefinedFunction
And for the second one:
def udf[RT: TypeTag, A1: TypeTag](f: Function1[A1, RT]): UserDefinedFunction
So the second one is actually more typed than the first one, as the second one takes into account the type of the function passed as argument, whereas the first one erases the type of the function.
That's why on the first one you need to define return type, because spark needs this information but can't infer it from function passed as argument as its return type is erased, whereas in the second one the return type is inferred from function passed as argument.
Also the function got passed to udf, which is (x:Int) => x, clearly has its input type defined but Spark claiming You're using untyped Scala UDF, which does not have the input type information?
What is important here is not the function, but how Spark creates an UDF from this function.
In both cases, the function to be transformed to UDF has its input and return types defined, but those types are erased and not taken into account when creating UDF using udf(AnyRef, DataType).
This doesn't answer your original question about what the different UDFs are, but if you want to get rid of the error, in Python you can include this line in your script: spark.sql("set spark.sql.legacy.allowUntypedScalaUDF=true").

PySpark Schema should be specified in DDL format as a string literal or output of the schema_of_json function instead of schemaofjson(`col1`);

I'm trying to infer a schema from a json like string with schema_of_json function and then use the schema to format that string value as a struct using from_json function. My code is
import pyspark.sql.functions as sqlf
dfTemp = readFromEventHubs()
df= dfTemp.withColumn("col1", sqlf.get_json_object(col("jsonString"), '$.*'))
col1Val= df.col1
jsonSchema = sqlf.schema_of_json(col1Val)
df.select(sqlf.from_json(df.col1, jsonSchema).alias("jsonCol"))
but I have the following exception
AnalysisException: 'Schema should be specified in DDL format as a string literal or output of the schema_of_json function instead of schemaofjson(`col1Val`);'
Just a precision, I'm using spark streaming.
What's wrong with my code, Thank you
schema_of_json expects a string representing a valid JSON object. You’re passing it a pyspark.sql.Column, likely because you’re hoping it will infer the schema of every single row. That won’t happen though.
from_json expects as its first positional argument a Column, that contains JSON strings and as its second argument pyspark.sql.types.StructType or pyspark.sql.types.ArrayType or even (since 2.3) a DDL-formatted string or a JSON format string (which is a specification).
That means you can’t infer per row a different schema.
If you know the schema before reading (chances are that you do know it), then pass it in (”schema-on-read”) when you call from_json. If you’re not fixed on Databricks Delta, you could use a different DataFrameReader: spark.read.json, leaving its keyword argument schema unspecified so that it will infer the schema.

Scala function TypeTag: T use type T in function

I need to parse several json fields, which I'm using Play Json to do it. As parsing may fail, I need to throw a custom exception for each field.
To read a field, I use this:
val fieldData = parseField[String](json \ fieldName, "fieldName")
My parseField function:
def parseField[T](result: JsLookupResult, fieldName: String): T = {
result.asOpt[T].getOrElse(throw new IllegalArgumentException(s"""Can't access $fieldName."""))
}
However, I get an error that reads:
Error:(17, 17) No Json deserializer found for type T. Try to implement
an implicit Reads or Format for this type.
result.asOpt[T].getOrElse(throw new IllegalArgumentException(s"""Can't access $fieldName."""))
Is there a way to tell the asOpt[] to use the type in T?
I strongly suggest that you do not throw exceptions. The Play JSON API has both a JsSuccess and JsError types that will help you encode parsing errors.
As per the documentation
To convert a Scala object to and from JSON, we use Json.toJson[T: Writes] and Json.fromJson[T: Reads] respectively. Play JSON provides the Reads and Writes typeclasses to define how to read or write specific types. You can get these either by using Play's automatic JSON macros, or by manually defining them. You can also read JSON from a JsValue using validate, as and asOpt methods. Generally it's preferable to use validate since it returns a JsResult which may contain an error if the JSON is malformed.
See https://github.com/playframework/play-json#reading-and-writing-objects
There is also a good example on the Play Discourse forum on how the API manifests in practice.

Spark SQL UDF from a string which represents scala code at runtime

I need to be able to register a udf from a string which I will get from a web service, i.e at run time I call a web service to get the scala code which constitutes the udf, compile it and register it as an udf in the spark context. As as example let's say my web service return the following scala code in a json response -
(row: Row, field:String) => {
import scala.util.{Try, Success, Failure}
val index: Int = Try(row.fieldIndex(field)) match {
case Success(_) => 1
case Failure(_) => 0
}
index
})
I want to compile this code on the fly and then register it as an udf. I have already multiple options such as using toolbox, twitter eval util etc. but found that I need to explicity specify the arguments types of the method while creating an instance for ex -
val code =
q"""
(a:String, b:String) => {
a+b
}
"""
val compiledCode = toolBox.compile(code)
val compiledFunc = compiledCode().asInstanceOf[(String, String) => Option[Any]]
This udf takes two strings as arguments hence I need to specify the types while creating the object like
compiledCode().asInstanceOf[(String, String) => Option[Any]]
The other option I explored is
https://stackoverflow.com/a/34371343/1218856
In both the cases I have to know the no of arguments, argument types and the return type before hand to instantiate the code as a method. But in my case as the udfs are created my users, I have no control over the no of arguments and thier types, so I would like to know if there any way I can register the UDF by compiling the scala code with out knowing the argument number and type information.
In a nut shell, I get the code as string, compile it and register it as udf without knowing the type information.
I think you'd be much better off by not trying to generate/execute code directly but defining a different kind of expression language and executing that. Something like ANTLR could help you with writing the grammar of that expression language and generating the parser and the Abstract Syntax Trees. Or even scala's parser combinators. It's of course more work but also a far less risky and error-prone way of allowing custom function execution.