How to define schema of streaming dataset dynamically to write to csv? - scala

I have a streaming dataset, reading from kafka and trying to write to CSV
case class Event(map: Map[String,String])
def decodeEvent(arrByte: Array[Byte]): Event = ...//some implementation
val eventDataset: Dataset[Event] = spark
.readStream
.format("kafka")
.load()
.select("value")
.as[Array[Byte]]
.map(decodeEvent)
Event holds Map[String,String] inside and to write to CSV I'll need some schema.
Let's say all the fields are of type String and so I tried the example from spark repo
val columns = List("year","month","date","topic","field1","field2")
val schema = new StructType() //Prepare schema programmatically
columns.foreach { field => schema.add(field, "string") }
val rowRdd = eventDataset.rdd.map { event => Row.fromSeq(
columns.map(c => event.getOrElse(c, "")
)}
val df = spark.sqlContext.createDataFrame(rowRdd, schema)
This gives error at runtime on line "eventDataset.rdd":
Caused by: org.apache.spark.sql.AnalysisException: Queries with
streaming sources must be executed with writeStream.start();;
Below doesn't work because '.map' has a List[String] not Tuple
eventDataset.map(event => columns.map(c => event.getOrElse(c,""))
.toDF(columns:_*)
Is there a way to achieve this with programmatic schema and structured streaming datasets?

I'd use much simpler approach:
import org.apache.spark.sql.functions._
eventDataset.select(columns.map(
c => coalesce($"map".getItem(c), lit("")).alias(c)
): _*).writeStream.format("csv").start(path)
but if you want something closer to the current solution skip RDD conversion
import org.apache.spark.sql.catalyst.encoders.RowEncoder
eventDataset.rdd.map(event =>
Row.fromSeq(columns.map(c => event.getOrElse(c,"")))
)(RowEncoder(schema)).writeStream.format("csv").start(path)

Related

Infer Schema from rdd to Dataframe in Spark Scala

This question is a reference from (Spark - creating schema programmatically with different data types)
I am trying infer schema from rdd to Dataframe , Below is my code
def inferType(field: String) = field.split(":")(1) match {
case "Integer" => IntegerType
case "Double" => DoubleType
case "String" => StringType
case "Timestamp" => TimestampType
case "Date" => DateType
case "Long" => LongType
case _ => StringType
}
val header = c1:String|c2:String|c3:Double|c4:Integer|c5:String|c6:Timestamp|c7:Long|c8:Date
val df1 = Seq(("a|b|44.44|5|c|2018-01-01 01:00:00|456|2018-01-01")).toDF("data")
val rdd1 = df1.rdd.map(x => Row(x.getString(0).split("\\|"): _*))
val schema = StructType(header.split("\\|").map(column => StructField(column.split(":")(0), inferType(column), true)))
val df = spark.createDataFrame(rdd1, schema)
df.show()
When I do the show , it throws the below error . I have to perform this operation on larger scale data and having trouble finding the right solution, can you anybody please help me find a solution for this or any other way, where I can achieve this.
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of int
Thanks in advance
Short answer: String/Text cannot be specified with custom types/formats.
What you are trying to do is that to parse string as sql columns. The difference from other example is that loads from csv, you are trying to just.
Working version can be achieved like this:
// skipped other details such as schematype, spark session...
val header = "c1:String|c2:String|c3:Double|c4:Integer"
// Create `Row` from `Seq`
val row = Row.fromSeq(Seq("a|b|44.44|12|"))
// Create `RDD` from `Row`
val rdd: RDD[Row] = spark.sparkContext
.makeRDD(List(row))
.map { row =>
row.getString(0).split("\\|") match {
case Array(col1, col2, col3, col4) =>
Row.fromTuple(col1, col2, col3.toDouble, col4.toInt)
}
}
val stt: StructType = StructType(
header
.split("\\|")
.map(column => StructField(column, inferType(column), true))
)
val dataFrame = spark.createDataFrame(rdd, stt)
dataFrame.show()
The reason to create a Row from Scala types is that introducing compatible types or Row respected types here.
Note I skipped date and time related fields, date conversions are tricky. You can check my another answer how to use formatted date and timestamps here

How to convert RDD[GenericRecord] to dataframe in scala?

I get tweets from kafka topic with Avro (serializer and deserializer).
Then i create a spark consumer which extracts tweets in Dstream of RDD[GenericRecord].
Now i want to convert each rdd to a dataframe to analyse these tweets via SQL.
Any solution to convert RDD[GenericRecord] to dataframe please ?
I spent some time trying to make this work (specially how deserialize the data properly but it looks like you already cover this) ... UPDATED
//Define function to convert from GenericRecord to Row
def genericRecordToRow(record: GenericRecord, sqlType : SchemaConverters.SchemaType): Row = {
val objectArray = new Array[Any](record.asInstanceOf[GenericRecord].getSchema.getFields.size)
import scala.collection.JavaConversions._
for (field <- record.getSchema.getFields) {
objectArray(field.pos) = record.get(field.pos)
}
new GenericRowWithSchema(objectArray, sqlType.dataType.asInstanceOf[StructType])
}
//Inside your stream foreachRDD
val yourGenericRecordRDD = ...
val schema = new Schema.Parser().parse(...) // your schema
val sqlType = SchemaConverters.toSqlType(new Schema.Parser().parse(strSchema))
var rowRDD = yourGeneircRecordRDD.map(record => genericRecordToRow(record, sqlType))
val df = sqlContext.createDataFrame(rowRDD , sqlType.dataType.asInstanceOf[StructType])
As you see, I am using a SchemaConverter to get the dataframe structure from the schema that you used to deserialize (this could be more painful with schema registry). For this you need the following dependency
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>3.2.0</version>
</dependency>
you will need to change your spark version depending on yours.
UPDATE: the code above only works for flat avro schemas.
For nested structures I used something different. You can copy the class SchemaConverters, it has to be inside of com.databricks.spark.avro (it uses some protected classes from the databricks package) or you can try to use the spark-bigquery dependency. The class will not be accessible by default, so you will need to create a class inside a package com.databricks.spark.avro to access the factory method.
package com.databricks.spark.avro
import com.databricks.spark.avro.SchemaConverters.createConverterToSQL
import org.apache.avro.Schema
import org.apache.spark.sql.types.StructType
class SchemaConverterUtils {
def converterSql(schema : Schema, sqlType : StructType) = {
createConverterToSQL(schema, sqlType)
}
}
After that you should be able to convert the data like
val schema = .. // your schema
val sqlType = SchemaConverters.toSqlType(schema).dataType.asInstanceOf[StructType]
....
//inside foreach RDD
var genericRecordRDD = deserializeAvroData(rdd)
///
var converter = SchemaConverterUtils.converterSql(schema, sqlType)
...
val rowRdd = genericRecordRDD.flatMap(record => {
Try(converter(record).asInstanceOf[Row]).toOption
})
//To DataFrame
val df = sqlContext.createDataFrame(rowRdd, sqlType)
A combination of https://stackoverflow.com/a/48828303/5957143 and https://stackoverflow.com/a/47267060/5957143 works for me.
I used the following to create MySchemaConversions
package com.databricks.spark.avro
import org.apache.avro.Schema
import org.apache.avro.generic.GenericRecord
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.DataType
object MySchemaConversions {
def createConverterToSQL(avroSchema: Schema, sparkSchema: DataType): (GenericRecord) => Row =
SchemaConverters.createConverterToSQL(avroSchema, sparkSchema).asInstanceOf[(GenericRecord) => Row]
}
And then I used
val myAvroType = SchemaConverters.toSqlType(schema).dataType
val myAvroRecordConverter = MySchemaConversions.createConverterToSQL(schema, myAvroType)
// unionedResultRdd is unionRDD[GenericRecord]
var rowRDD = unionedResultRdd.map(record => MyObject.myConverter(record, myAvroRecordConverter))
val df = sparkSession.createDataFrame(rowRDD , myAvroType.asInstanceOf[StructType])
The advantage of having myConverter in the object MyObject is that you will not encounter serialization issues (java.io.NotSerializableException).
object MyObject{
def myConverter(record: GenericRecord,
myAvroRecordConverter: (GenericRecord) => Row): Row =
myAvroRecordConverter.apply(record)
}
Even though something like this may help you,
val stream = ...
val dfStream = stream.transform(rdd:RDD[GenericRecord]=>{
val df = rdd.map(_.toSeq)
.map(seq=> Row.fromSeq(seq))
.toDF(col1,col2, ....)
df
})
I'd like to suggest you an alternate approach. With Spark 2.x you can skip the whole process of creating DStreams. Instead, you can do something like this with structured streaming,
val df = ss.readStream
.format("com.databricks.spark.avro")
.load("/path/to/files")
This will give you a single dataframe which you can directly query. Here, ss is the instance of spark session. /path/to/files is the place where all your avro files are being dumped from kafka.
PS: You may need to import spark-avro
libraryDependencies += "com.databricks" %% "spark-avro" % "4.0.0"
Hope this helped. Cheers
You can use createDataFrame(rowRDD: RDD[Row], schema: StructType), which is available in the SQLContext object. Example for converting an RDD of an old DataFrame:
import sqlContext.implicits.
val rdd = oldDF.rdd
val newDF = oldDF.sqlContext.createDataFrame(rdd, oldDF.schema)
Note that there is no need to explicitly set any schema column. We reuse the old DF's schema, which is of StructType class and can be easily extended. However, this approach sometimes is not possible, and in some cases can be less efficient than the first one.

How to convert RDD[Row] to RDD[String]

I have a DataFrame called source, a table from mysql
val source = sqlContext.read.jdbc(jdbcUrl, "source", connectionProperties)
I have converted it to rdd by
val sourceRdd = source.rdd
but its RDD[Row] I need RDD[String]
to do transformations like
source.map(rec => (rec.split(",")(0).toInt, rec)), .subtractByKey(), etc..
Thank you
You can use Row. mkString(sep: String): String method in a map call like this :
val sourceRdd = source.rdd.map(_.mkString(","))
You can change the "," parameter by whatever you want.
Hope this help you, Best Regards.
What is your schema?
If it's just a String, you can use:
import spark.implicits._
val sourceDS = source.as[String]
val sourceRdd = sourceDS.rdd // will give RDD[String]
Note: use sqlContext instead of spark in Spark 1.6 - spark is a SparkSession, which is a new class in Spark 2.0 and is a new entry point to SQL functionality. It should be used instead of SQLContext in Spark 2.x
You can also create own case classes.
Also you can map rows - here source is of type DataFrame, we use partial function in map function:
val sourceRdd = source.rdd.map { case x : Row => x(0).asInstanceOf[String] }.map(s => s.split(","))

How to read records in JSON format from Kafka using Structured Streaming?

I am trying to use structured streaming approach using Spark-Streaming based on DataFrame/Dataset API to load a stream of data from Kafka.
I use:
Spark 2.10
Kafka 0.10
spark-sql-kafka-0-10
Spark Kafka DataSource has defined underlying schema:
|key|value|topic|partition|offset|timestamp|timestampType|
My data come in json format and they are stored in the value column. I am looking for a way how to extract underlying schema from value column and update received dataframe to columns stored in value? I tried the approach below but it does not work:
val columns = Array("column1", "column2") // column names
val rawKafkaDF = sparkSession.sqlContext.readStream
.format("kafka")
.option("kafka.bootstrap.servers","localhost:9092")
.option("subscribe",topic)
.load()
val columnsToSelect = columns.map( x => new Column("value." + x))
val kafkaDF = rawKafkaDF.select(columnsToSelect:_*)
// some analytics using stream dataframe kafkaDF
val query = kafkaDF.writeStream.format("console").start()
query.awaitTermination()
Here I am getting Exception org.apache.spark.sql.AnalysisException: Can't extract value from value#337; because in time of creation of the stream, values inside are not known...
Do you have any suggestions?
From the Spark perspective value is just a byte sequence. It has no knowledge about the serialization format or content. To be able to extract the filed you have to parse it first.
If data is serialized as a JSON string you have two options. You can cast value to StringType and use from_json and provide a schema:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions.from_json
val schema: StructType = StructType(Seq(
StructField("column1", ???),
StructField("column2", ???)
))
rawKafkaDF.select(from_json($"value".cast(StringType), schema))
or cast to StringType, extract fields by path using get_json_object:
import org.apache.spark.sql.functions.get_json_object
val columns: Seq[String] = ???
val exprs = columns.map(c => get_json_object($"value", s"$$.$c"))
rawKafkaDF.select(exprs: _*)
and cast later to the desired types.

How to convert a case-class-based RDD into a DataFrame?

The Spark documentation shows how to create a DataFrame from an RDD, using Scala case classes to infer a schema. I am trying to reproduce this concept using sqlContext.createDataFrame(RDD, CaseClass), but my DataFrame ends up empty. Here's my Scala code:
// sc is the SparkContext, while sqlContext is the SQLContext.
// Define the case class and raw data
case class Dog(name: String)
val data = Array(
Dog("Rex"),
Dog("Fido")
)
// Create an RDD from the raw data
val dogRDD = sc.parallelize(data)
// Print the RDD for debugging (this works, shows 2 dogs)
dogRDD.collect().foreach(println)
// Create a DataFrame from the RDD
val dogDF = sqlContext.createDataFrame(dogRDD, classOf[Dog])
// Print the DataFrame for debugging (this fails, shows 0 dogs)
dogDF.show()
The output I'm seeing is:
Dog(Rex)
Dog(Fido)
++
||
++
||
||
++
What am I missing?
Thanks!
All you need is just
val dogDF = sqlContext.createDataFrame(dogRDD)
Second parameter is part of Java API and expects you class follows java beans convention (getters/setters). Your case class doesn't follow this convention, so no property is detected, that leads to empty DataFrame with no columns.
You can create a DataFrame directly from a Seq of case class instances using toDF as follows:
val dogDf = Seq(Dog("Rex"), Dog("Fido")).toDF
Case Class Approach won't Work in cluster mode. It'll give ClassNotFoundException to the case class you defined.
Convert it a RDD[Row] and define the schema of your RDD with StructField and then createDataFrame like
val rdd = data.map { attrs => Row(attrs(0),attrs(1)) }
val rddStruct = new StructType(Array(StructField("id", StringType, nullable = true),StructField("pos", StringType, nullable = true)))
sqlContext.createDataFrame(rdd,rddStruct)
toDF() wont work either