How to write ElasticsearchSink for Spark Structured streaming - scala

I'm using Spark structured streaming to process high volume data from Kafka queue and doing some heaving ML computation but I need to write the result to Elasticsearch.
I tried using the ForeachWriter but can't get a SparkContext inside it, the other option probably is to do HTTP Post inside the ForeachWriter.
Right now, am thinking of writing my own ElasticsearchSink.
Is there any documentation out there to create a Sink for Spark Structured streaming ?

If you are using Spark 2.2+ and ES 6.x then there is a ES sink out of the box:
df
.writeStream
.outputMode(OutputMode.Append())
.format("org.elasticsearch.spark.sql")
.option("es.mapping.id", "mappingId")
.start("index/type") // index/type
If you are using ES 5.x like I was you need to implement an EsSink and an EsSinkProvider:
EsSinkProvider:
class EsSinkProvider extends StreamSinkProvider with DataSourceRegister {
override def createSink(sqlContext: SQLContext,
parameters: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode): Sink = {
EsSink(sqlContext, parameters, partitionColumns, outputMode)
}
override def shortName(): String = "my-es-sink"
}
EsSink:
case class ElasticSearchSink(sqlContext: SQLContext,
options: Map[String, String],
partitionColumns: Seq[String],
outputMode: OutputMode)
extends Sink {
override def addBatch(batchId: Long, df: DataFrame): Unit = synchronized {
val schema = data.schema
// this ensures that the same query plan will be used
val rdd: RDD[String] = df.queryExecution.toRdd.mapPartitions { rows =>
val converter = CatalystTypeConverters.createToScalaConverter(schema)
rows.map(converter(_).asInstanceOf[Row]).map(_.getAs[String](0))
}
// from org.elasticsearch.spark.rdd library
EsSpark.saveJsonToEs(rdd, "index/type", Map("es.mapping.id" -> "mappingId"))
}
}
And then lastly, when writing the stream use this provider class as the format:
df
.writeStream
.queryName("ES-Writer")
.outputMode(OutputMode.Append())
.format("path.to.EsProvider")
.start()

You can take a look at ForeachSink. It shows how to implement a Sink and convert DataFrame to RDD (it's very tricky and has a large comment). However, please be aware that the Sink API is still private and immature, it might be changed in future.

Related

write into kafka topic using spark and scala

I am reading data from Kafka topic and write back the data received into another Kafka topic.
Below is my code ,
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.kafka.clients.producer.{Kafka Producer, ProducerRecord}
import org.apache.spark.sql.ForeachWriter
//loading data from kafka
val data = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "*******:9092")
.option("subscribe", "PARAMTABLE")
.option("startingOffsets", "latest")
.load()
//Extracting value from Json
val schema = new StructType().add("PARAM_INSTANCE_ID",IntegerType).add("ENTITY_ID",IntegerType).add("PARAM_NAME",StringType).add("VALUE",StringType)
val df1 = data.selectExpr("CAST(value AS STRING)")
val dataDF = df1.select(from_json(col("value"), schema).as("data")).select("data.*")
//Insert into another Kafka topic
val topic = "SparkParamValues"
val brokers = "********:9092"
val writer = new KafkaSink(topic, brokers)
val query = dataDF.writeStream
.foreach(writer)
.outputMode("update")
.start().awaitTermination()
I am getting the below error,
<Console>:47:error :not found: type KafkaSink
val writer = new KafkaSink(topic, brokers)
I am very new to spark, Someone suggest how to resolve this or verify the above code whether it is correct. Thanks in advance .
In spark structured streaming, You can write to Kafka topic after reading from another topic using existing DataStreamWriter for Kafka or you can create your own sink by extending ForeachWriter class.
Without using custom sink:
You can use below code to write a dataframe to kafka. Assuming df as the dataframe generated by reading from kafka topic.
Here dataframe should have atleast one column with name as value. If you have multiple columns you should merge them into one column and name it as value. If key column is not specified then key will be marked as null in destination topic.
df.select("key", "value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "<topicName>")
.start()
.awaitTermination()
Using custom sink:
If you want to implement your own Kafka sink you need create a class by extending ForeachWriter. You need override some methods and pass the object of this class to foreach() method.
// By using Anonymous class to extend ForeachWriter
df.writeStream.foreach(new ForeachWriter[Row] {
// If you are writing Dataset[String] then new ForeachWriter[String]
def open(partitionId: Long, version: Long): Boolean = {
// open connection
}
def process(record: String) = {
// write rows to connection
}
def close(errorOrNull: Throwable): Unit = {
// close the connection
}
}).start()
You can check this databricks notebook for the implemented code (Scroll down and check the code under Kafka Sink heading). I think you are referring to this page only. To solve the issue you need to make sure that KafkaSink class is available to your spark code. You can bring both spark code file and class file in same package. If you are running on spark-shell paste the KafkaSink class before pasting spark code.
Read structured streaming kafka integration guide to explore more.

Input data received all in lowercase on spark streaming in databricks using DataFrame

My spark streaming application consumes data from an aws kenisis and is deployed in databricks. I am using the org.apache.spark.sql.Row.mkString method to consume the data and the whole data is received in lowercase. The actual input had camel case field name and values but is received in lowercase on consuming.
I have tried consuming from a simple java application and is receiving the data in the correct from from the kinesis queue. The issue is only in the spark streaming application using DataFrames and running in databricks.
// scala code
val query = dataFrame
.selectExpr("lcase(CAST(data as STRING)) as krecord")
.writeStream
.foreach(new ForeachWriter[Row] {
def open(partitionId: Long, version: Long): Boolean = {
true
}
def process(row: Row) = {
logger.info("Record received in data frame is -> " + row.mkString)
processDFStreamData(row.mkString, outputHandler, kBase, ruleEvaluator)
}
def close(errorOrNull: Throwable): Unit = {
}
})
.start()
Expectation is the spark streaming input json should be in the same case
letter (camel case)as the data in the kinesis , it should not be converted to lower case once received using data frame.
Any thought's on what might be causing this?
Fixed the issue, the lcase used in the select expression was the culprit, updated the code as below and it worked.
val query = dataFrame
.selectExpr("CAST(data as STRING) as krecord")
.writeStream
.foreach(new ForeachWriter[Row] {
.........

spark structured streaming avro to avro and custom Sink

Can someone refer me to a good example or sample for writing avro in S3 or any file system? I am using a custom Sink but I would like to pass some properties Map through constructor of the SinkProvider which can be further pass to the Sink, I guess?
Updated Code:
val query = df.mapPartitions { itr =>
itr.map { row =>
val rowInBytes = row.getAs[Array[Byte]]("value")
MyUtils.deserializeAvro[GenericRecord](rowInBytes).toString
}
}.writeStream
.format("com.test.MyStreamingSinkProvider")
.outputMode(OutputMode.Append())
.queryName("testQ" )
.trigger(ProcessingTime("10 seconds"))
.option("checkpointLocation", "my_checkpoint_dir")
.start()
query.awaitTermination()
Sink Provider:
class MyStreamingSinkProvider extends StreamSinkProvider {
override def createSink(sqlContext: SQLContext, parameters: Map[String, String], partitionColumns: Seq[String], outputMode: OutputMode): Sink = {
new MyStreamingSink
}
}
Sink:
class MyStreamingSink extends Sink with Serializable {
final val log: Logger = LoggerFactory.getLogger(classOf[MyStreamingSink])
override def addBatch(batchId: Long, data: DataFrame): Unit = {
//For saving as text doc
data.rdd.saveAsTextFile("path")
log.warn(s"Total records processed: ${data.count()}")
log.warn("Data saved.")
}
}
You should be able to pass parameters to your custom sink via writeStream.option(key, value):
DataStreamWriter writer = dataset.writeStream()
.format("com.test.MyStreamingSinkProvider")
.outputMode(OutputMode.Append())
.queryName("testQ" )
.trigger(ProcessingTime("10 seconds"))
.option("key_1", "value_1")
.option("key_2", "value_2")
.start()
In this case parameters in method MyStreamingSinkProvider.createSink(...) will contain key_1 and key_2

How to use update output mode with FileFormat format?

I am trying to use spark structured streaming in update output mode write to a file. I found this StructuredSessionization example and it works fine as long as the console format is configured. But if I change the output mode to:
val query = sessionUpdates
.writeStream
.outputMode("update")
.format("json")
.option("path", "/work/output/data")
.option("checkpointLocation", "/work/output/checkpoint")
.start()
I get following error:
Exception in thread "main" org.apache.spark.sql.AnalysisException: Data source json does not support Update output mode;
at org.apache.spark.sql.execution.datasources.DataSource.createSink(DataSource.scala:279)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:286)
at palyground.StructuredStreamingMergeSpans$.main(StructuredStreamingMergeSpans.scala:84)
at palyground.StructuredStreamingMergeSpans.main(StructuredStreamingMergeSpans.scala)
Can i use update mode and use the FileFormat to write the result table to a file sink?
In the source code i found a pattern match that ensures Append Mode.
You cannot write to file in update mode using spark structured streaming. You need to write ForeachWriter for it. I have written simple for each writer here. You can modify it according to your requirement.
val writerForText = new ForeachWriter[Row] {
var fileWriter: FileWriter = _
override def process(value: Row): Unit = {
fileWriter.append(value.toSeq.mkString(","))
}
override def close(errorOrNull: Throwable): Unit = {
fileWriter.close()
}
override def open(partitionId: Long, version: Long): Boolean = {
FileUtils.forceMkdir(new File(s"src/test/resources/${partitionId}"))
fileWriter = new FileWriter(new File(s"src/test/resources/${partitionId}/temp"))
true
}
}
val query = sessionUpdates
.writeStream
.outputMode("update")
.foreach(writerForText)
.start()
Append output mode is required for any of the FileFormat sinks, incl. json, which Spark Structured Streaming validates before starting your streaming query.
if (outputMode != OutputMode.Append) {
throw new AnalysisException(
s"Data source $className does not support $outputMode output mode")
}
In Spark 2.4, you could use DataStreamWriter.foreach operator or the brand new DataStreamWriter.foreachBatch operator that simply accepts a function that accepts the Dataset of a batch and the batch ID.
foreachBatch(function: (Dataset[T], Long) => Unit): DataStreamWriter[T]

Structured Spark Streaming multiple writes

I am using a data stream to be written to a kafka topic as well as hbase.
For Kafka, I use a format as this:
dataset.selectExpr("id as key", "to_json(struct(*)) as value")
.writeStream.format("kafka")
.option("kafka.bootstrap.servers", Settings.KAFKA_URL)
.option("topic", Settings.KAFKA_TOPIC2)
.option("checkpointLocation", "/usr/local/Cellar/zookeepertmp")
.outputMode(OutputMode.Complete())
.start()
and then for Hbase, I do something like this:
dataset.writeStream.outputMode(OutputMode.Complete())
.foreach(new ForeachWriter[Row] {
override def process(r: Row): Unit = {
//my logic
}
override def close(errorOrNull: Throwable): Unit = {}
override def open(partitionId: Long, version: Long): Boolean = {
true
}
}).start().awaitTermination()
This writes to Hbase as expected but doesn't always write to the kafka topic. I am not sure why that is happening.
Use foreachBatch in spark:
If you want to write the output of a streaming query to multiple locations, then you can simply write the output DataFrame/Dataset multiple times. However, each attempt to write can cause the output data to be recomputed (including possible re-reading of the input data). To avoid recomputations, you should cache the output DataFrame/Dataset, write it to multiple locations, and then uncache it. Here is an outline.
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
batchDF.persist()
batchDF.write.format(…).save(…) // location 1
batchDF.write.format(…).save(…) // location 2
batchDF.unpersist()
}