I have an ETL (spark-scala). After writing in a table, a message with "header" must be sent to Kafka. I couldn't add the header in the message. I have a spark DataFrame with the "key" and the "value". I have tried to incorporate the "header" but when I read the message, this comes with the header field as "NO HEADERS". How could I incorporate the header in the message?
This is an example of what I have already tried:
val df = spark.createDataFrame(spark.sparkContext.parallelize(mySeq), schemaDf)
.withColumn("headers", split(col("Marca"), "###|%").cast("array<string>"))
.selectExpr("CAST(Marca AS STRING) AS key", "to_json(struct(*)) AS value", "headers AS headers")
.as[(String, String, Array[String])]
df
.write.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "test-kafka-notification")
.option("includeHeaders", "true")
.save()
I have also tried with the column "headers" as String and it didn't work either.
The message I get is this one, with "NO HEADERS":
NO_HEADERS peugeot {"Marca":"peugeot","Modelo":"308","Unidades":3,"headers":["peugeot"]}
NO_HEADERS Seat {"Marca":"Seat","Modelo":"Arona","Unidades":4,"headers":["Seat"]}
NO_HEADERS Seat {"Marca":"Seat","Modelo":"Leon","Unidades":10,"headers":["Seat"]}
NO_HEADERS Seat {"Marca":"Seat","Modelo":"Ibiza","Unidades":6,"headers":["Seat"]}
NO_HEADERS Opel {"Marca":"Opel","Modelo":"Corsa","Unidades":6,"headers":["Opel"]}
NO_HEADERS Fiat {"Marca":"Fiat","Modelo":"Punto","Unidades":16,"headers":["Fiat"]}
NO_HEADERS Fiat {"Marca":"Fiat","Modelo":"Panda","Unidades":2,"headers":["Fiat"]}
NO_HEADERS Mercedes {"Marca":"Mercedes","Modelo":"Benz","Unidades":1,"headers":["Mercedes"]}
Thank you, regards,
The easiest way of add headers is using withColumn as you tried in your example.
Headers must be an array<struct key:string,value:binary>, so make sure your column meet this format. You don't need to add
.option("includeHeaders", "true")
in your df.write, thats used when you are reading from Kafka into a DataFrame with a readStream.
Here is a simple example you can replicate:
selectDf
.withColumn("headers",array(struct(lit("headerKey") as "key", lit("headerValue").cast("binary") as "value")))
.write
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "output")
.save()
Don't forget to check if your Spark version and your Kafka client are compatible with headers.
Related
Hi Im reading froma kafka topic and i want to process the data received from kafka such as tockenization, filtering out unncessary data, removing stop words and finally I want to write back to another Kafka topic
// read from kafka
val readStream = existingSparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("subscribe", "my.raw") // Always read from offset 0, for dev/testing purpose
.load()
val df = readStream.selectExpr("CAST(value AS STRING)" )
df.show(false)
val df_json = df.select(from_json(col("value"), mySchema.defineSchema()).alias("parsed_value"))
val df_text = df_json.withColumn("text", col("parsed_value.payload.Text"))
// perform some data processing actions such as tokenization etc and return cleanedDataframe as the final result
// write back to kafka
val writeStream = cleanedDataframe
.writeStream
.outputMode("append")
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("topic", "writing.val")
.start()
writeStream.awaitTermination()
Then I am getting the below error
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Queries with streaming sources must be executed with
writeStream.start();;
Then I have edited my code as follows to read from kafka and write into console
// read from kafka
val readStream = existingSparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("subscribe", "my.raw") // Always read from offset 0, for dev/testing purpose
.load()
// write to console
val df = readStream.selectExpr("CAST(value AS STRING)" )
val query = df.writeStream
.outputMode("append")
.format("console")
.start().awaitTermination();
// then perform the data processing part as mentioned in the first half
With the second method, continuously data was displaying in the console but it never run through data processing part. Can I know how can I read from a kafka topic and then perform some actions ( tokenization, removing stop words) on the received data and finally writing back to a new kafka topic?
EDIT
Stack Trace is pointing at df.show(false) in the above code during the error
There are two common problems in your current implementation:
Apply show in a streaming context
Code after awaitTermination will not be executed
To 1.
The method show is an action (as opposed to a tranformation) on a dataframe. As you are dealing with streaming dataframes this will cause an error as streaming queries need to be started with start (just as the Excpetion text is telling you).
To 2.
The method awaitTermination is a blocking method which means that subsequent code will not be executed in each micro-batch.
Overall Solution
If you want to read and write to Kafka and in-between want to understand what data is being processed by showing the data in the console you can do the following:
// read from kafka
val readStream = existingSparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("subscribe", "my.raw") // Always read from offset 0, for dev/testing purpose
.load()
// write to console
val df = readStream.selectExpr("CAST(value AS STRING)" )
df.writeStream
.outputMode("append")
.format("console")
.start()
val df_json = df.select(from_json(col("value"), mySchema.defineSchema()).alias("parsed_value"))
val df_text = df_json.withColumn("text", col("parsed_value.payload.Text"))
// perform some data processing actions such as tokenization etc and return cleanedDataframe as the final result
// write back to kafka
// the columns `key` and `value` of the DataFrame `cleanedDataframe` will be used for producing the message into the Kafka topic.
val writeStreamKafka = cleanedDataframe
.writeStream
.outputMode("append")
.format("kafka")
.option("kafka.bootstrap.servers", hostAddress)
.option("topic", "writing.val")
.start()
existingSparkSession.awaitAnyTermination()
Note the existingSparkSession.awaitAnyTermination() at the very end of the code without using awaitTermination directly after the start. Also, remember that the columns key and value of the DataFrame cleanedDataframe will be used for producing the message into the Kafka topic. However, a column key is not required, see also here
In addition, in case you are using checkpointing (recommended) then you need to have two different locations set: one for the console stream and the other one for the kafka output stream. It is important to keep in mind that those the streaming queries run independently.
I am reading log lines from kafka topic through spark structured streaming,separating the fields of loglines, performing some manipulations on fields and storing storing it in dataframe with separate columns for every fields. I want to write this dataframe to kafka
Below is my sample dataframe and writestream for writing it to kafka
val dfStructuredWrite = dfProcessedLogs.select(
dfProcessedLogs("result").getItem("_1").as("col1"),
dfProcessedLogs("result").getItem("_2").as("col2"),
dfProcessedLogs("result").getItem("_17").as("col3"))
dfStructuredWrite
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.start()
Above code gives me below error
Required attribute 'value' not found
I believe this is because I don't have my dataframe in key/value format.How can I write my existing dataframe to kafka in most efficient way?
The Dataframe being written to Kafka should have the following columns in schema:
key (optional) (type: string or binary)
value (required) (type: string or binary)
topic (optional) (type: string)
In your case there is no value column and the exception is thrown.
You have to modify it to add at least value column, ex:
import org.apache.spark.sql.functions.{concat, lit}
dfStructuredWrite.select(concat($"col1", lit(" "), $"col2", lit(" "), $"col3").alias("value"))
For more details you can check: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#writing-data-to-kafka
I am reading messages from a kafka topic
messageDFRaw = spark.readStream\
.format("kafka")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("subscribe", "test-message")\
.load()
messageDF = messageDFRaw.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING) as dict")
When I print the data frame from the above query I get the below console output.
|key|dict|
|#badbunny |{"channel": "#badbunny", "username": "mgat22", "message": "cool"}|
How can I create a data frame from the DataStreamReader such that I have a dataframe with columns as |key|channel| username| message|
I tried following the accepted answer in How to read records in JSON format from Kafka using Structured Streaming?
struct = StructType([
StructField("channel", StringType()),
StructField("username", StringType()),
StructField("message", StringType()),
])
messageDFRaw.select(from_json("CAST(value AS STRING)", struct))
but, I get Expected type 'StructField', got 'StructType' instead in from_json()
I ignored the warning Expected type 'StructField', got 'StructType' instead in from_json().
However, I had to cast the value from kafka message initially and then parse the json schema later.
messageDF = messageDFRaw.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
messageParsedDF = messageDF.select(from_json("value", struct_schema).alias("message"))
messageFlattenedDF = messageParsedDF.selectExpr("value.channel", "value.username", "value.message")
I am trying out a simple example to publish data to Kafka and consume it using Spark.
Here is the Producer code:
var kafka_input = spark.sql("""
SELECT CAST(Id AS STRING) as key,
to_json(
named_struct(
'Id', Id,
'Title',Title
)
) as value
FROM offer_data""")
kafka_input.write
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers)
.option("topic", topicName)
.save()
I verified that kafka_inputhas json string for value and the a number casted as string for key.
Here is the consumer code:
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers)
.option("subscribe", topicName)
.load()
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
df.take(50)
display(df)
The data I receive on the consumer side is base64 encoded string.
How do I decode the value in Scala?
Also, this read statement is not flushing these records from the Kafka queue. I am assuming this is because I am not sending any ack signal back to Kafka. IS that correct? If so, how do I do that?
try this..
df.foreach(row => {
val key = row.getAs[Array[Byte]]("key")
val value = row.getAs[Array[Byte]]("value")
println(scala.io.Source.fromBytes(key,"UTF-8").mkString)
println(scala.io.Source.fromBytes(value,"UTF-8").mkString)
})
The problem was with my usage of SelectExpr..It does nto do an in-place transofrmation..it returns the transformed data.
Fix:
df1 = df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.as[(String, String)]
display(df1)
I was trying to reproduce the example from [Databricks][1] and apply it to the new connector to Kafka and spark structured streaming however I cannot parse the JSON correctly using the out-of-the-box methods in Spark...
note: the topic is written into Kafka in JSON format.
val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", IP + ":9092")
.option("zookeeper.connect", IP + ":2181")
.option("subscribe", TOPIC)
.option("startingOffsets", "earliest")
.option("max.poll.records", 10)
.option("failOnDataLoss", false)
.load()
The following code won't work, I believe that's because the column json is a string and does not match the method from_json signature...
val df = ds1.select($"value" cast "string" as "json")
.select(from_json("json") as "data")
.select("data.*")
Any tips?
[UPDATE] Example working:
https://github.com/katsou55/kafka-spark-structured-streaming-example/blob/master/src/main/scala-2.11/Main.scala
First you need to define the schema for your JSON message. For example
val schema = new StructType()
.add($"id".string)
.add($"name".string)
Now you can use this schema in from_json method like below.
val df = ds1.select($"value" cast "string" as "json")
.select(from_json($"json", schema) as "data")
.select("data.*")