How to create dataframe from JsObject for writing to S3 - scala

I am running my job using AWS Glue 2 writing scala script
I have a dataframe in input_df which is created by reading JSON from S3.
I then pass each record of it to SomeMethodThatReturnsAJsObject which returns a JsObject. The SomeMethodThatReturnsAJsObject seem to return a proper JsObject response.
I want to write this to a different S3 bucket s3://destination/
When I try to create a Dataframe from the output of the this method, I get this error on the Glue logs
Exception in User Class: java.lang.UnsupportedOperationException : No Encoder found for play.api.libs.json.JsValue
object GlueJob extends TransformerHooks {
/*Some additional code*/
implicit val session: scala.slick.driver.MySQLDriver.simple.Session = null
import sparkSession.implicits._
val result_df: DataFrame = input_df.map(input_json => {
val targetData: play.api.libs.json.JsObject = SomeMethodThatReturnsAJsObject(input_json)
targetData
}).toDF()
result_df.write.format("org.apache.spark.sql.json_metadata").mode(SaveMode.Append).save("s3://destination/")
sparkSession.sparkContext.stop()
}
}

Related

Unable to convert RDD[Java Class] to Dataframe in spark scala

I have avro message and .avsc file. I have generated the java class from .avsc file. Now I want to convert the avro(json) message into data frame. I read the message. Successfully decoded the message and I got RDD[Product] but I am unable to convert RDD[Product] into dataframe. I need to save the message in .avro format.
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("test").setMaster("local[*]")
val spark = SparkSession.builder().config(conf).getOrCreate()
import spark.implicits._
val rdd = spark.read.textFile("/Users/lucy/product_avro.json").rdd
val rdd1 = rdd.map(string => toProduct(string))
spark.createDataFrame(rdd1, classOf[Product]) // not working
}
def toProduct(input: String): Product = {
return new SpecificDatumReader[Product](Product.SCHEMA$)
.read(null, DecoderFactory.get().jsonDecoder(Product.SCHEMA$, input))
}
Error: java.lang.UnsupportedOperationException: Cannot have circular references in bean class, but got the circular reference of class org.apache.avro.Schema

Apache Spark - Generic method for loading csv data to dataset

I would like to write generic method with three input parameters:
filePath - String
schema - ?
case class
So, my idea is to write a method like this:
def load_sms_ds(filePath: String, schemaInfo: ?, cc: ?) = {
val ds = spark.read
.format("csv")
.option("header", "true")
.schema(?)
.option("delimiter",",")
.option("dateFormat", "yyyy-MM-dd HH:mm:ss.SSS")
.load(schemaInfo)
.as[?]
ds
}
and to return dataset depending on a input parameters. I am not sure though what type should parameters schemaInfo and cc be?
First of all I would reccommend reading the spark sql programming guide. This contains some thing that I think will generally help you as you learn spark.
Lets run through the process of reading in a csv file using a case class to define the schema.
First add the varioud imports needed for this example:
import java.io.{File, PrintWriter} // for reading / writing the example data
import org.apache.spark.sql.types.{StringType, StructField} // to define the schema
import org.apache.spark.sql.catalyst.ScalaReflection // used to generate the schema from a case class
import scala.reflect.runtime.universe.TypeTag // used to provide type information of the case class at runtime
import org.apache.spark.sql.Dataset, SparkSession}
import org.apache.spark.sql.Encoder // Used by spark to generate the schema
Define a case class, the different types available can be found here:
case class Example(
stringField : String,
intField : Int,
doubleField : Double
)
Add the method for extracting a schema (StructType) given a case class type as a parameter:
// T : TypeTag means that an implicit value of type TypeTag[T] must be available at the method call site. Scala will automatically generate this for you. See [here][3] for further details.
def schemaOf[T: TypeTag]: StructType = {
ScalaReflection
.schemaFor[T] // this method requires a TypeTag for T
.dataType
.asInstanceOf[StructType] // cast it to a StructType, what spark requires as its Schema
}
Defnie a method to read in a csv file from a path with the schema defined using a case class:
// The implicit Encoder is needed by the `.at` method in order to create the Dataset[T]. The TypeTag is required by the schemaOf[T] call.
def readCSV[T : Encoder : TypeTag](
filePath: String
)(implicit spark : SparkSession) : Dataset[T]= {
spark.read
.option("header", "true")
.option("dateFormat", "yyyy-MM-dd HH:mm:ss.SSS")
.schema(schemaOf[T])
.csv(filePath) // spark provides this more explicit call to read from a csv file by default it uses comma and the separator but this can be changes.
.as[T]
}
Create a sparkSession:
implicit val spark = SparkSession.builder().master("local").getOrCreate()
Write some sample data to a temp file:
val data =
s"""|stringField,intField,doubleField
|hello,1,1.0
|world,2,2.0
|""".stripMargin
val file = File.createTempFile("test",".csv")
val pw = new PrintWriter(file)
pw.write(data)
pw.close()
An example of calling this method:
import spark.implicits._ // so that an implicit Encoder gets pulled in for the case class
val df = readCSV[Example](file.getPath)
df.show()

How to convert RDD[GenericRecord] to dataframe in scala?

I get tweets from kafka topic with Avro (serializer and deserializer).
Then i create a spark consumer which extracts tweets in Dstream of RDD[GenericRecord].
Now i want to convert each rdd to a dataframe to analyse these tweets via SQL.
Any solution to convert RDD[GenericRecord] to dataframe please ?
I spent some time trying to make this work (specially how deserialize the data properly but it looks like you already cover this) ... UPDATED
//Define function to convert from GenericRecord to Row
def genericRecordToRow(record: GenericRecord, sqlType : SchemaConverters.SchemaType): Row = {
val objectArray = new Array[Any](record.asInstanceOf[GenericRecord].getSchema.getFields.size)
import scala.collection.JavaConversions._
for (field <- record.getSchema.getFields) {
objectArray(field.pos) = record.get(field.pos)
}
new GenericRowWithSchema(objectArray, sqlType.dataType.asInstanceOf[StructType])
}
//Inside your stream foreachRDD
val yourGenericRecordRDD = ...
val schema = new Schema.Parser().parse(...) // your schema
val sqlType = SchemaConverters.toSqlType(new Schema.Parser().parse(strSchema))
var rowRDD = yourGeneircRecordRDD.map(record => genericRecordToRow(record, sqlType))
val df = sqlContext.createDataFrame(rowRDD , sqlType.dataType.asInstanceOf[StructType])
As you see, I am using a SchemaConverter to get the dataframe structure from the schema that you used to deserialize (this could be more painful with schema registry). For this you need the following dependency
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>3.2.0</version>
</dependency>
you will need to change your spark version depending on yours.
UPDATE: the code above only works for flat avro schemas.
For nested structures I used something different. You can copy the class SchemaConverters, it has to be inside of com.databricks.spark.avro (it uses some protected classes from the databricks package) or you can try to use the spark-bigquery dependency. The class will not be accessible by default, so you will need to create a class inside a package com.databricks.spark.avro to access the factory method.
package com.databricks.spark.avro
import com.databricks.spark.avro.SchemaConverters.createConverterToSQL
import org.apache.avro.Schema
import org.apache.spark.sql.types.StructType
class SchemaConverterUtils {
def converterSql(schema : Schema, sqlType : StructType) = {
createConverterToSQL(schema, sqlType)
}
}
After that you should be able to convert the data like
val schema = .. // your schema
val sqlType = SchemaConverters.toSqlType(schema).dataType.asInstanceOf[StructType]
....
//inside foreach RDD
var genericRecordRDD = deserializeAvroData(rdd)
///
var converter = SchemaConverterUtils.converterSql(schema, sqlType)
...
val rowRdd = genericRecordRDD.flatMap(record => {
Try(converter(record).asInstanceOf[Row]).toOption
})
//To DataFrame
val df = sqlContext.createDataFrame(rowRdd, sqlType)
A combination of https://stackoverflow.com/a/48828303/5957143 and https://stackoverflow.com/a/47267060/5957143 works for me.
I used the following to create MySchemaConversions
package com.databricks.spark.avro
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.DataType
object MySchemaConversions {
def createConverterToSQL(avroSchema: Schema, sparkSchema: DataType): (GenericRecord) => Row =
SchemaConverters.createConverterToSQL(avroSchema, sparkSchema).asInstanceOf[(GenericRecord) => Row]
}
And then I used
val myAvroType = SchemaConverters.toSqlType(schema).dataType
val myAvroRecordConverter = MySchemaConversions.createConverterToSQL(schema, myAvroType)
// unionedResultRdd is unionRDD[GenericRecord]
var rowRDD = unionedResultRdd.map(record => MyObject.myConverter(record, myAvroRecordConverter))
val df = sparkSession.createDataFrame(rowRDD , myAvroType.asInstanceOf[StructType])
The advantage of having myConverter in the object MyObject is that you will not encounter serialization issues (java.io.NotSerializableException).
object MyObject{
def myConverter(record: GenericRecord,
myAvroRecordConverter: (GenericRecord) => Row): Row =
myAvroRecordConverter.apply(record)
}
Even though something like this may help you,
val stream = ...
val dfStream = stream.transform(rdd:RDD[GenericRecord]=>{
val df = rdd.map(_.toSeq)
.map(seq=> Row.fromSeq(seq))
.toDF(col1,col2, ....)
df
})
I'd like to suggest you an alternate approach. With Spark 2.x you can skip the whole process of creating DStreams. Instead, you can do something like this with structured streaming,
val df = ss.readStream
.format("com.databricks.spark.avro")
.load("/path/to/files")
This will give you a single dataframe which you can directly query. Here, ss is the instance of spark session. /path/to/files is the place where all your avro files are being dumped from kafka.
PS: You may need to import spark-avro
libraryDependencies += "com.databricks" %% "spark-avro" % "4.0.0"
Hope this helped. Cheers
You can use createDataFrame(rowRDD: RDD[Row], schema: StructType), which is available in the SQLContext object. Example for converting an RDD of an old DataFrame:
import sqlContext.implicits.
val rdd = oldDF.rdd
val newDF = oldDF.sqlContext.createDataFrame(rdd, oldDF.schema)
Note that there is no need to explicitly set any schema column. We reuse the old DF's schema, which is of StructType class and can be easily extended. However, this approach sometimes is not possible, and in some cases can be less efficient than the first one.

How to define schema of streaming dataset dynamically to write to csv?

I have a streaming dataset, reading from kafka and trying to write to CSV
case class Event(map: Map[String,String])
def decodeEvent(arrByte: Array[Byte]): Event = ...//some implementation
val eventDataset: Dataset[Event] = spark
.readStream
.format("kafka")
.load()
.select("value")
.as[Array[Byte]]
.map(decodeEvent)
Event holds Map[String,String] inside and to write to CSV I'll need some schema.
Let's say all the fields are of type String and so I tried the example from spark repo
val columns = List("year","month","date","topic","field1","field2")
val schema = new StructType() //Prepare schema programmatically
columns.foreach { field => schema.add(field, "string") }
val rowRdd = eventDataset.rdd.map { event => Row.fromSeq(
columns.map(c => event.getOrElse(c, "")
)}
val df = spark.sqlContext.createDataFrame(rowRdd, schema)
This gives error at runtime on line "eventDataset.rdd":
Caused by: org.apache.spark.sql.AnalysisException: Queries with
streaming sources must be executed with writeStream.start();;
Below doesn't work because '.map' has a List[String] not Tuple
eventDataset.map(event => columns.map(c => event.getOrElse(c,""))
.toDF(columns:_*)
Is there a way to achieve this with programmatic schema and structured streaming datasets?
I'd use much simpler approach:
import org.apache.spark.sql.functions._
eventDataset.select(columns.map(
c => coalesce($"map".getItem(c), lit("")).alias(c)
): _*).writeStream.format("csv").start(path)
but if you want something closer to the current solution skip RDD conversion
import org.apache.spark.sql.catalyst.encoders.RowEncoder
eventDataset.rdd.map(event =>
Row.fromSeq(columns.map(c => event.getOrElse(c,"")))
)(RowEncoder(schema)).writeStream.format("csv").start(path)

Create SparkSQL UDF with non serializable objects

I'm trying to write an UDF that I would like to use on Hive tables in an sqlContext. Is it in any way possible to include objects from other libraries that are not serializable? Here's a minimal example of what does not work:
def myUDF(s: String) = {
import sun.misc.BASE64Encoder
val coder= new BASE64Encoder
val encoded= decoder.encode(s)
encoded
}
I register the function in the spark shell as udf function
val encoding = sqlContext.udf.register("encoder", myUDF)
If I try to run it on a table "test"
sqlContext.sql("SELECT encoder(colname) from test").show()
I get the error
org.apache.spark.SparkException: Task not serializable
object not serializable (class: sun.misc.BASE64Encoder, value: sun.misc.BASE64Encoder#4a7f9a94)
Is there a workaround for this? I tried embedding myUDF in an object and in a class but that didn't work either.
You can try defining udf function as
def encoder = udf((s: String) => {
import sun.misc.BASE64Encoder
val coder= new BASE64Encoder
val encoded= coder.encode(s.getBytes("UTF-8"))
encoded
})
And call the udf function as
dataframe.withColumn("encoded", encoder(col("id"))).show
Updated
As #santon has pointed out that BASE64Encoder encoder is initiated for each row in the dataframe which might lead to performance issues. The solution to that would be to create a static object of BASE64Encoder and call it within udf function.