Structured Streaming reading from a Kafka topic - apache-kafka

I have read a csv file and converted the value field into bytes and wrote to a Kafka topic using Kafka producer application. Now I am trying to read from Kafka topic using Structured Streaming but not able to apply the custom kryo deserialization on value field.
Can anyone tell me how to use the custom deserialization in Structured Streaming?

I had a similar problem, basically, I had all my Kafka's messages on Protobuf and I solve that with UDF.
from pyspark.sql.functions import udf
def deserialization_function(message):
#You need to add your code to deserialize your messages
#I returned a json but you can return other structure
json = {"x": x_deserializable,
"y": y_deserializable,
"w": w_deserializable,
"z": z_deserializable,
return json
schema = StructType() \
.add("x", TimestampType()) \
.add("y", StringType()) \
.add("z", StringType()) \
.add("w", StringType())
own_udf = udf(deserialization_function, schema)
stream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
.option("subscribe", topic) \
.load()
query = stream \
.select(col("value")) \
.select((own_udf("value")).alias("value_udf")) \
.select("value_udf.x", "value_udf.y", "value_udf.w", "value_udf.z")

Related

Spark read stream from kafka using delta tables

i'm trying to read stream from Kafka topic using spark streaming/python, I can read the message and dump it to a bronze table with default kafka message schema, but i cannot cast the key and values from binary to string, I've tried the following approach, none of them worked:
approach 1:
raw_kafka_events = (spark.readStream
.format("kafka")
.option("subscribe", TOPIC)
.option("kafka.bootstrap.servers", KAFKA_BROKER)
.option("startingOffsets", "earliest")
.option("kafka.security.protocol", "SSL") \
.option("kafka.ssl.truststore.location", SSL_TRUST_STORE_FILE_LOCATION) \
.option("kafka.ssl.keystore.location", SSL_KEY_STORE_FILE_LOCATION) \
.option("kafka.ssl.keystore.password", SSL_KEY_STORE_PASSWORD) \
.option("kafka.ssl.truststore.password", SSL_TRUST_STORE_PASSWORD) \
.option("kafka.ssl.key.password", SSL_KEY_PASSWORD) \
.option("kafka.ssl.keystore.type", "JKS") \
.option("kafka.ssl.truststore.type", "JKS") \
.option("failOnDataLoss", "false") \
.load()).selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
#dlt.table(
comment="the raw message from kafa topic",
table_properties={"pipelines.reset.allowed":"false"}
)
def kafka_bronze():
return raw_kafka_events
error:
Failed to merge fields 'key' and 'key'. Failed to merge incompatible data types BinaryType and StringType
approach 2:
raw_kafka_events = (spark.readStream
.format("kafka")
.option("subscribe", TOPIC)
.option("kafka.bootstrap.servers", KAFKA_BROKER)
.option("startingOffsets", "earliest")
.option("kafka.security.protocol", "SSL") \
.option("kafka.ssl.truststore.location", SSL_TRUST_STORE_FILE_LOCATION) \
.option("kafka.ssl.keystore.location", SSL_KEY_STORE_FILE_LOCATION) \
.option("kafka.ssl.keystore.password", SSL_KEY_STORE_PASSWORD) \
.option("kafka.ssl.truststore.password", SSL_TRUST_STORE_PASSWORD) \
.option("kafka.ssl.key.password", SSL_KEY_PASSWORD) \
.option("kafka.ssl.keystore.type", "JKS") \
.option("kafka.ssl.truststore.type", "JKS") \
.option("failOnDataLoss", "false") \
.load())
raw_kafka_events.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
#dlt.table(
comment="the raw message from kafa topic",
table_properties={"pipelines.reset.allowed":"false"}
)
def kafka_bronze():
return raw_kafka_events
No error message, but later when i checked the table kafka_bronze, it showed the column key and value are still binary format
approach 3: added kafka_silver table:
raw_kafka_events = (spark.readStream
.format("kafka")
.option("subscribe", TOPIC)
.option("kafka.bootstrap.servers", KAFKA_BROKER)
.option("startingOffsets", "earliest")
.option("kafka.security.protocol", "SSL") \
.option("kafka.ssl.truststore.location", SSL_TRUST_STORE_FILE_LOCATION) \
.option("kafka.ssl.keystore.location", SSL_KEY_STORE_FILE_LOCATION) \
.option("kafka.ssl.keystore.password", SSL_KEY_STORE_PASSWORD) \
.option("kafka.ssl.truststore.password", SSL_TRUST_STORE_PASSWORD) \
.option("kafka.ssl.key.password", SSL_KEY_PASSWORD) \
.option("kafka.ssl.keystore.type", "JKS") \
.option("kafka.ssl.truststore.type", "JKS") \
.option("failOnDataLoss", "false") \
.load())
#dlt.table(
comment="the raw message from kafa topic",
table_properties={"pipelines.reset.allowed":"false"}
)
def kafka_bronze():
return raw_kafka_events
#dlt.table(comment="real schema for kafka payload",
temporary=False)
def kafka_silver():
return (
# kafka streams are (timestamp,value)
# value contains the kafka payload
dlt.read_stream("kafka_bronze")
.select(col("key").cast("string"))
.select(col("value").cast("string"))
)
error:
Column 'value' does not exist.
How can I cast the key/value to string after reading them from kafka topic? i'd prefer to dump the string valued key/value to bronze table, but it's impossible, i can dump them to silver table too
First, it's recommended to define that raw_kafka_events variable inside the function, so it will be local to that function.
In the second approach your problem is that you just do raw_kafka_events.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") without assingning it to the variable, like this: raw_kafka_events = raw_kafka_events.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)").
Second problem is that when you use expressions like CAST(key AS STRING), then fields get a new name, matching this expression. Change it to CAST(key AS STRING) as key, and CAST(value AS STRING) as value - this should fix the 1st problem.
In the second approach, you have a select statements chained:
def kafka_silver():
return (
# kafka streams are (timestamp,value)
# value contains the kafka payload
dlt.read_stream("kafka_bronze")
.select(col("key").cast("string"))
.select(col("value").cast("string"))
)
but after your first select you will get a dataframe only with one column - key. You need to change code to:
dlt.read_stream("kafka_bronze") \
.select(col("key").cast("string").alias("key"),
col("value").cast("string").alias("value"))

Databricks - Delta Live Table Pipeline - Ingest Kafka Avro using Schema Registry

I'm new to Azure Databricks and I'm trying implement an Azure Databricks Delta Live Table Pipeline that ingests from a Kafka topic containing messages where the values are SchemaRegistry encoded AVRO.
Work done so far...
Exercise to Consume and Write to a Delta Table
Using the example in Confluent Example, I've read the "raw" message via:
rawAvroDf = (
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", confluentBootstrapServers)
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.jaas.config", "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))
.option("kafka.ssl.endpoint.identification.algorithm", "https")
.option("kafka.sasl.mechanism", "PLAIN")
.option("subscribe", confluentTopicName)
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "false")
.load()
.withColumn('key', fn.col("key").cast(StringType()))
.withColumn('fixedValue', fn.expr("substring(value, 6, length(value)-5)"))
.withColumn('valueSchemaId', binary_to_string(fn.expr("substring(value, 2, 4)")))
.select('topic', 'partition', 'offset', 'timestamp', 'timestampType', 'key', 'valueSchemaId','fixedValue')
)
Created a SchemaRegistryClient:
from confluent_kafka.schema_registry import SchemaRegistryClient
import ssl
schema_registry_conf = {
'url': schemaRegistryUrl,
'basic.auth.user.info': '{}:{}'.format(confluentRegistryApiKey, confluentRegistrySecret)}
schema_registry_client = SchemaRegistryClient(schema_registry_conf)
Defined a deserialization function that looks up the schema ID from the start of the binary message:
import pyspark.sql.functions as fn
from pyspark.sql.avro.functions import from_avro
def parseAvroDataWithSchemaId(df, ephoch_id):
cachedDf = df.cache()
fromAvroOptions = {"mode":"FAILFAST"}
def getSchema(id):
return str(schema_registry_client.get_schema(id).schema_str)
distinctValueSchemaIdDF = cachedDf.select(fn.col('valueSchemaId').cast('integer')).distinct()
for valueRow in distinctValueSchemaIdDF.collect():
currentValueSchemaId = sc.broadcast(valueRow.valueSchemaId)
currentValueSchema = sc.broadcast(getSchema(currentValueSchemaId.value))
filterValueDF = cachedDf.filter(fn.col('valueSchemaId') == currentValueSchemaId.value)
filterValueDF \
.select('topic', 'partition', 'offset', 'timestamp', 'timestampType', 'key', from_avro('fixedValue', currentValueSchema.value, fromAvroOptions).alias('parsedValue')) \
.write \
.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.save(deltaTablePath)
Finally written to a delta table:
rawAvroDf.writeStream \
.option("checkpointLocation", checkpointPath) \
.foreachBatch(parseAvroDataWithSchemaId) \
.queryName("clickStreamTestFromConfluent") \
.start()
Created a (Bronze/Landing) Delta Live Table
import dlt
import pyspark.sql.functions as fn
from pyspark.sql.types import StringType
#dlt.table(
name = "<<landingTable>>",
path = "<<storage path>>",
comment = "<< descriptive comment>>"
)
def landingTable():
jasConfig = "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret)
binary_to_string = fn.udf(lambda x: str(int.from_bytes(x, byteorder='big')), StringType())
kafkaOptions = {
"kafka.bootstrap.servers": confluentBootstrapServers,
"kafka.security.protocol": "SASL_SSL",
"kafka.sasl.jaas.config": jasConfig,
"kafka.ssl.endpoint.identification.algorithm": "https",
"kafka.sasl.mechanism": "PLAIN",
"subscribe": confluentTopicName,
"startingOffsets": "earliest",
"failOnDataLoss": "false"
}
return (
spark
.readStream
.format("kafka")
.options(**kafkaOptions)
.load()
.withColumn('key', fn.col("key").cast(StringType()))
.withColumn('valueSchemaId', binary_to_string(fn.expr("substring(value, 2, 4)")))
.withColumn('avroValue', fn.expr("substring(value, 6, length(value)-5)"))
.select(
'topic',
'partition',
'offset',
'timestamp',
'timestampType',
'key',
'valueSchemaId',
'avroValue'
)
Help Required on:
Ensure that the landing table is a STREAMING LIVE TABLE
Deserialize the avro encode message-value (a STREAMING LIVE VIEW calling a python UDF?)

Reading avro messages from Kafka in spark streaming/structured streaming

I am using pyspark for the first time.
Spark Version : 2.3.0
Kafka Version : 2.2.0
I have a kafka producer which sends nested data in avro format and I am trying to write code in spark-streaming/ structured streaming in pyspark which will deserialize the avro coming from kafka into dataframe do transformations write it in parquet format into s3.
I was able to find avro converters in spark/scala but support in pyspark has not yet been added. How do I convert the same in pyspark.
Thanks.
As like you mentioned , Reading Avro message from Kafka and parsing through pyspark, don't have direct libraries for the same . But we can read/parsing Avro message by writing small wrapper and call that function as UDF in your pyspark streaming code as below .
Reference :
Pyspark 2.4.0, read avro from kafka with read stream - Python
Note: Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".
Refererence: https://spark-test.github.io/pyspark-coverage-site/pyspark_sql_avro_functions_py.html
Spark-Submit :
[adjust the package versions to match spark/avro version based installation]
/usr/hdp/2.6.1.0-129/spark2/bin/pyspark --packages org.apache.spark:spark-avro_2.11:2.4.3 --conf spark.ui.port=4064
Pyspark Streaming Code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
from pyspark.sql.column import Column, _to_java_column
from pyspark.sql.functions import col, struct
from pyspark.sql.functions import udf
import json
import csv
import time
import os
# Spark Streaming context :
spark = SparkSession.builder.appName('streamingdata').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)
# Kafka Topic Details :
KAFKA_TOPIC_NAME_CONS = "topicname"
KAFKA_OUTPUT_TOPIC_NAME_CONS = "topic_to_hdfs"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'localhost.com:9093'
# Creating readstream DataFrame :
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
.option("subscribe", KAFKA_TOPIC_NAME_CONS) \
.option("startingOffsets", "latest") \
.option("failOnDataLoss" ,"false")\
.option("kafka.security.protocol","SASL_SSL")\
.option("kafka.client.id" ,"MCI-CIL")\
.option("kafka.sasl.kerberos.service.name","kafka")\
.option("kafka.ssl.truststore.location", "/path/kafka_trust.jks") \
.option("kafka.ssl.truststore.password", "changeit") \
.option("kafka.sasl.kerberos.keytab","/path/bdpda.headless.keytab") \
.option("kafka.sasl.kerberos.principal","bdpda") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
df1.registerTempTable("test")
# Deserilzing the Avro code function
from pyspark.sql.column import Column, _to_java_column
def from_avro(col):
jsonFormatSchema = """
{
"type": "record",
"name": "struct",
"fields": [
{"name": "col1", "type": "long"},
{"name": "col2", "type": "string"}
]
}"""
sc = SparkContext._active_spark_context
avro = sc._jvm.org.apache.spark.sql.avro
f = getattr(getattr(avro, "package$"), "MODULE$").from_avro
return Column(f(_to_java_column(col), jsonFormatSchema))
spark.udf.register("JsonformatterWithPython", from_avro)
squared_udf = udf(from_avro)
df1 = spark.table("test")
df2 = df1.select(squared_udf("value"))
# Declaring the Readstream Schema DataFrame :
df2.coalesce(1).writeStream \
.format("parquet") \
.option("checkpointLocation","/path/chk31") \
.outputMode("append") \
.start("/path/stream/tgt31")
ssc.awaitTermination()

pyspark kafka streaming data handler

I'm using spark 2.3.2 with pyspark and just figured out that foreach and foreachBatch are not available in 'DataStreamWriter' object in this configuration. The problem is the company Hadoop is 2.6 and spark 2.4(that provides what I need) doesn't work(SparkSession is crashing). There is some another alternative to send data to a custom handler and process streaming data?
This is my code until now:
def streamLoad(self,customHandler):
options = self.options
self.logger.info("Recuperando o schema baseado na estrutura do JSON")
jsonStrings = ['{"sku":"9","ean":"4","name":"DVD","description":"foo description","categories":[{"code":"M02_BLURAY_E_DVD_PLAYER"}],"attributes":[{"name":"attrTeste","value":"Teste"}]}']
myRDD = self.spark.sparkContext.parallelize(jsonStrings)
jsonSchema = self.spark.read.json(myRDD).schema # Maybe there is a way to serialize this
self.logger.info("Iniciando o streaming no Kafka[opções: {}]".format(str(options)))
df = self.spark \
.readStream \
.format("kafka") \
.option("maxFilesPerTrigger", 1) \
.option("kafka.bootstrap.servers", options["kafka.bootstrap.servers"]) \
.option("startingOffsets", options["startingOffsets"]) \
.option("subscribe", options["subscribe"]) \
.option("failOnDataLoss", options["failOnDataLoss"]) \
.load() \
.select(
col('value').cast("string").alias('json'),
col('key').cast("string").alias('kafka_key'),
col("timestamp").cast("string").alias('kafka_timestamp')
) \
.withColumn('pjson', from_json(col('json'), jsonSchema)).drop('json')
query = df \
.writeStream \
.foreach(customHandler) \ #This doesn't work in spark 2.3.x Alternatives, please?
.start()
query.awaitTermination()

How to use from_json with Kafka connect 0.10 and Spark Structured Streaming?

I was trying to reproduce the example from [Databricks][1] and apply it to the new connector to Kafka and spark structured streaming however I cannot parse the JSON correctly using the out-of-the-box methods in Spark...
note: the topic is written into Kafka in JSON format.
val ds1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", IP + ":9092")
.option("zookeeper.connect", IP + ":2181")
.option("subscribe", TOPIC)
.option("startingOffsets", "earliest")
.option("max.poll.records", 10)
.option("failOnDataLoss", false)
.load()
The following code won't work, I believe that's because the column json is a string and does not match the method from_json signature...
val df = ds1.select($"value" cast "string" as "json")
.select(from_json("json") as "data")
.select("data.*")
Any tips?
[UPDATE] Example working:
https://github.com/katsou55/kafka-spark-structured-streaming-example/blob/master/src/main/scala-2.11/Main.scala
First you need to define the schema for your JSON message. For example
val schema = new StructType()
.add($"id".string)
.add($"name".string)
Now you can use this schema in from_json method like below.
val df = ds1.select($"value" cast "string" as "json")
.select(from_json($"json", schema) as "data")
.select("data.*")