Reading avro messages from Kafka in spark streaming/structured streaming - pyspark

I am using pyspark for the first time.
Spark Version : 2.3.0
Kafka Version : 2.2.0
I have a kafka producer which sends nested data in avro format and I am trying to write code in spark-streaming/ structured streaming in pyspark which will deserialize the avro coming from kafka into dataframe do transformations write it in parquet format into s3.
I was able to find avro converters in spark/scala but support in pyspark has not yet been added. How do I convert the same in pyspark.
Thanks.

As like you mentioned , Reading Avro message from Kafka and parsing through pyspark, don't have direct libraries for the same . But we can read/parsing Avro message by writing small wrapper and call that function as UDF in your pyspark streaming code as below .
Reference :
Pyspark 2.4.0, read avro from kafka with read stream - Python
Note: Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".
Refererence: https://spark-test.github.io/pyspark-coverage-site/pyspark_sql_avro_functions_py.html
Spark-Submit :
[adjust the package versions to match spark/avro version based installation]
/usr/hdp/2.6.1.0-129/spark2/bin/pyspark --packages org.apache.spark:spark-avro_2.11:2.4.3 --conf spark.ui.port=4064
Pyspark Streaming Code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
from pyspark.sql.column import Column, _to_java_column
from pyspark.sql.functions import col, struct
from pyspark.sql.functions import udf
import json
import csv
import time
import os
# Spark Streaming context :
spark = SparkSession.builder.appName('streamingdata').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)
# Kafka Topic Details :
KAFKA_TOPIC_NAME_CONS = "topicname"
KAFKA_OUTPUT_TOPIC_NAME_CONS = "topic_to_hdfs"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'localhost.com:9093'
# Creating readstream DataFrame :
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
.option("subscribe", KAFKA_TOPIC_NAME_CONS) \
.option("startingOffsets", "latest") \
.option("failOnDataLoss" ,"false")\
.option("kafka.security.protocol","SASL_SSL")\
.option("kafka.client.id" ,"MCI-CIL")\
.option("kafka.sasl.kerberos.service.name","kafka")\
.option("kafka.ssl.truststore.location", "/path/kafka_trust.jks") \
.option("kafka.ssl.truststore.password", "changeit") \
.option("kafka.sasl.kerberos.keytab","/path/bdpda.headless.keytab") \
.option("kafka.sasl.kerberos.principal","bdpda") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
df1.registerTempTable("test")
# Deserilzing the Avro code function
from pyspark.sql.column import Column, _to_java_column
def from_avro(col):
jsonFormatSchema = """
{
"type": "record",
"name": "struct",
"fields": [
{"name": "col1", "type": "long"},
{"name": "col2", "type": "string"}
]
}"""
sc = SparkContext._active_spark_context
avro = sc._jvm.org.apache.spark.sql.avro
f = getattr(getattr(avro, "package$"), "MODULE$").from_avro
return Column(f(_to_java_column(col), jsonFormatSchema))
spark.udf.register("JsonformatterWithPython", from_avro)
squared_udf = udf(from_avro)
df1 = spark.table("test")
df2 = df1.select(squared_udf("value"))
# Declaring the Readstream Schema DataFrame :
df2.coalesce(1).writeStream \
.format("parquet") \
.option("checkpointLocation","/path/chk31") \
.outputMode("append") \
.start("/path/stream/tgt31")
ssc.awaitTermination()

Related

Databricks - Delta Live Table Pipeline - Ingest Kafka Avro using Schema Registry

I'm new to Azure Databricks and I'm trying implement an Azure Databricks Delta Live Table Pipeline that ingests from a Kafka topic containing messages where the values are SchemaRegistry encoded AVRO.
Work done so far...
Exercise to Consume and Write to a Delta Table
Using the example in Confluent Example, I've read the "raw" message via:
rawAvroDf = (
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", confluentBootstrapServers)
.option("kafka.security.protocol", "SASL_SSL")
.option("kafka.sasl.jaas.config", "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))
.option("kafka.ssl.endpoint.identification.algorithm", "https")
.option("kafka.sasl.mechanism", "PLAIN")
.option("subscribe", confluentTopicName)
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "false")
.load()
.withColumn('key', fn.col("key").cast(StringType()))
.withColumn('fixedValue', fn.expr("substring(value, 6, length(value)-5)"))
.withColumn('valueSchemaId', binary_to_string(fn.expr("substring(value, 2, 4)")))
.select('topic', 'partition', 'offset', 'timestamp', 'timestampType', 'key', 'valueSchemaId','fixedValue')
)
Created a SchemaRegistryClient:
from confluent_kafka.schema_registry import SchemaRegistryClient
import ssl
schema_registry_conf = {
'url': schemaRegistryUrl,
'basic.auth.user.info': '{}:{}'.format(confluentRegistryApiKey, confluentRegistrySecret)}
schema_registry_client = SchemaRegistryClient(schema_registry_conf)
Defined a deserialization function that looks up the schema ID from the start of the binary message:
import pyspark.sql.functions as fn
from pyspark.sql.avro.functions import from_avro
def parseAvroDataWithSchemaId(df, ephoch_id):
cachedDf = df.cache()
fromAvroOptions = {"mode":"FAILFAST"}
def getSchema(id):
return str(schema_registry_client.get_schema(id).schema_str)
distinctValueSchemaIdDF = cachedDf.select(fn.col('valueSchemaId').cast('integer')).distinct()
for valueRow in distinctValueSchemaIdDF.collect():
currentValueSchemaId = sc.broadcast(valueRow.valueSchemaId)
currentValueSchema = sc.broadcast(getSchema(currentValueSchemaId.value))
filterValueDF = cachedDf.filter(fn.col('valueSchemaId') == currentValueSchemaId.value)
filterValueDF \
.select('topic', 'partition', 'offset', 'timestamp', 'timestampType', 'key', from_avro('fixedValue', currentValueSchema.value, fromAvroOptions).alias('parsedValue')) \
.write \
.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.save(deltaTablePath)
Finally written to a delta table:
rawAvroDf.writeStream \
.option("checkpointLocation", checkpointPath) \
.foreachBatch(parseAvroDataWithSchemaId) \
.queryName("clickStreamTestFromConfluent") \
.start()
Created a (Bronze/Landing) Delta Live Table
import dlt
import pyspark.sql.functions as fn
from pyspark.sql.types import StringType
#dlt.table(
name = "<<landingTable>>",
path = "<<storage path>>",
comment = "<< descriptive comment>>"
)
def landingTable():
jasConfig = "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret)
binary_to_string = fn.udf(lambda x: str(int.from_bytes(x, byteorder='big')), StringType())
kafkaOptions = {
"kafka.bootstrap.servers": confluentBootstrapServers,
"kafka.security.protocol": "SASL_SSL",
"kafka.sasl.jaas.config": jasConfig,
"kafka.ssl.endpoint.identification.algorithm": "https",
"kafka.sasl.mechanism": "PLAIN",
"subscribe": confluentTopicName,
"startingOffsets": "earliest",
"failOnDataLoss": "false"
}
return (
spark
.readStream
.format("kafka")
.options(**kafkaOptions)
.load()
.withColumn('key', fn.col("key").cast(StringType()))
.withColumn('valueSchemaId', binary_to_string(fn.expr("substring(value, 2, 4)")))
.withColumn('avroValue', fn.expr("substring(value, 6, length(value)-5)"))
.select(
'topic',
'partition',
'offset',
'timestamp',
'timestampType',
'key',
'valueSchemaId',
'avroValue'
)
Help Required on:
Ensure that the landing table is a STREAMING LIVE TABLE
Deserialize the avro encode message-value (a STREAMING LIVE VIEW calling a python UDF?)

Kafkastream is not listening on input when created through SparkSession Builder

I am trying to create a Kafka Consumer which uses MongoDB-Spark-Connector in the same program. Something like Kafka input as RDD --> to Dataframe and then store it in the MongoDB for later use.
My Producer is up and running and the "standard" consumer looks like this and gets the messages nicely:
# Spark
from pyspark import SparkContext
# Spark Streaming
from pyspark.streaming import StreamingContext
# Kafka
from pyspark.streaming.kafka import KafkaUtils
# json parsing
import json
sc = SparkContext(appName="PythonSparkStreamingKafka_RM_01")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 30)
kafkaStream = KafkaUtils.createStream(ssc, 'localhost:2181', 'spark-streaming-consumer', {'trump':1})
parsed = kafkaStream.map(lambda v: json.loads(v[1]))
parsed.pprint()
ssc.start()
ssc.awaitTermination()
The "modified" consumer I want to use, which is built through SparkSessionBuilder with the config options to use mongodb looks like this:
#Additional for Session Building and Preprocessing
from pyspark import SparkContext
from pyspark import SQLContext
from pyspark.sql import SparkSession
import collections
# Spark Streaming
from pyspark.streaming import StreamingContext
# Kafka
from pyspark.streaming.kafka import KafkaUtils
# json parsing
import json
# Build the SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("TrumpTweets") \
.config("spark.executor.memory", "1gb") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/trumptweets.tweets") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/trumptweets.tweets") \
.getOrCreate()
ssc = StreamingContext(spark.sparkContext, 30)
kafkaStream = KafkaUtils.createStream(ssc, 'localhost:2181', 'spark-streaming-consumer', {'trump':1})
parsed = kafkaStream.map(lambda v: json.loads(v[1]))
parsed.pprint()
ssc.start()
ssc.awaitTermination()
it runs nicely, but does not receive any messages... I don't see anything different other than the SessionBuilder, which doesn't produce any error messages or so.
Please help me out, I'm really stuck on this one... Any other way is also appreciated!

How do I convert a dataframe to JSON and write to kafka topic with key

I'm trying to write a dataframe to kafka in JSON format and add a key to the data frame in Scala, i'm currently working with this sample from the kafka-spark:
df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.save()
Is there a to_json method that can be used (instead of the json(path) option which I believe writes out to a file in JSON format) and is there a key option that can be used to replace the null value with an actual key.
This is a minimal example in scala. Let's say you have a dataframe df with columns x and y. Here's a minimal example:
val dataDS = df
.select(
$"x".cast(StringType),
$"y".cast(StringType)
)
.toJSON
.withColumn("key", lit("keyname"))
dataDS
.write
.format("kafka")
.option("kafka.bootstrap.servers", "servername:port")
.option("topic", "topicname")
.save()
Remember you need the spark-sql-kafka library: e.g. for spark-shell is loaded with
spark-shell --packages "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1"
You can make use of the to_json SQL function to convert your columns into a JSON.
See Scala code below which is also making use of the Spark SQL built-in function struct in Spark version 2.4.5. Just make sure that your are naming your columns as key and value or applying corresponding aliases in your selectExpr.
import org.apache.spark.sql.functions.{col, struct, to_json}
import org.apache.spark.sql.SparkSession
object Main extends App {
val spark = SparkSession.builder()
.appName("myAppName")
.master("local[*]")
.getOrCreate()
// create DataFrame
import spark.implicits._
val df = Seq((3, "Alice"), (5, "Bob")).toDF("age", "name")
df.show(false)
// convert columns into json string
val df2 = df.select(col("name"),to_json(struct($"*"))).toDF("key", "value")
df2.show(false)
// +-----+------------------------+
// |key |value |
// +-----+------------------------+
// |Alice|{"age":3,"name":"Alice"}|
// |Bob |{"age":5,"name":"Bob"} |
// +-----+------------------------+
// write to Kafka with jsonString as value
df2.selectExpr("key", "value")
.write
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "test-topic")
.save()
}
This will return the following data into your Kafka topic:
kafka-console-consumer --bootstrap-server localhost:9092 --property print.key=true --property print.value=true --topic test-topic
Alice {"age":3,"name":"Alice"}
Bob {"age":5,"name":"Bob"}
You can use toJSON() method on dataframe to convert your record to json message.
df = spark.createDataFrame([('user_first_name','user_last_nmae',100)], ['first_name','last_name','ID'])
import json
from datetime import datetime
from pyspark.sql.functions import lit
json_df = json.loads(df.withColumn('date_as_key', lit(datetime.now().date())).toJSON().first())
print json_df
{u'date_as_key': u'2019-07-29', u'first_name': u'user_first_name', u'last_name': u'user_last_nmae', u'ID': 100}
Your code may be like this
from pyspark.sql.functions import lit
df.withColumn('key', lit(datetime.now())).toJSON()
.write
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "topic1")
.save()
Scala:
import org.apache.spark.sql.Column;
someDF.withColumn("key",lit("name")).show() // replace "name" with your variable
someDF.withColumn("key",lit("name")).toJSON.first() // toJSON is available as variable on dataframe in Scala
someDF.withColumn("key",lit("name")).toJSON.first()
res5: String = {"number":8,"word":"bat","key":"name"}

Data source io.pivotal.greenplum.spark.GreenplumRelationProvider does not support streamed writing

I am trying to read data from kafka and upload it into greenplum database using spark. i am using greenplum-spark connecter but i am getting Data source io.pivotal.greenplum.spark.GreenplumRelationProvider does not support streamed writing.
Is it that greenplum source doesnot support streaming data? I can see on the website saying "Continuous ETL pipeline (streaming)".
I have tried giving datasource as "greenplum" and "io.pivotal.greenplum.spark.GreenplumRelationProvider" into .format("datasource")
val EventStream = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", args(0))
.option("subscribe", args(1))
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "false")
.load
val gscWriteOptionMap = Map(
"url" -> "link for greenplum",
"user" -> "****",
"password" -> "****",
"dbschema" -> "dbname"
)
val stateEventDS = EventStream
.selectExpr("CAST(key AS String)", "*****(value)")
.as[(String, ******)]
.map(_._2)
val EventOutputStream = stateEventDS.writeStream
.format("io.pivotal.greenplum.spark.GreenplumRelationProvider")
.options(gscWriteOptionMap)
.start()
assetEventOutputStream.awaitTermination()
What version of GPDB / Spark are you using?
You could bypass spark in favor of the Greenplum-Kafka connector.
https://gpdb.docs.pivotal.io/5170/greenplum-kafka/overview.html
In earlier versions, the Greenplum-Spark Connector exposed a Spark data source named io.pivotal.greenplum.spark.GreenplumRelationProvider to read data from Greenplum Database into a Spark DataFrame.
In later versions, the connector exposes a Spark data source named greenplum to transfer data between Spark and Greenplum Database.
Should be something like --
val EventOutputStream = stateEventDS.write.format("greenplum")
.options(gscWriteOptionMap)
.save()
See: https://greenplum-spark.docs.pivotal.io/160/write_to_gpdb.html
Greenplum Spark Structured Streaming
Demonstrates how to use writeStream API with GPDB using JDBC
The following code block reads using a rate stream source and uses the JDBC based sink to stream in batches to GPDB
Batch based streaming
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import scala.concurrent.duration._
val sq = spark.
readStream.
format("rate").
load.
writeStream.
format("myjdbc").
option("checkpointLocation", "/tmp/jdbc-checkpoint").
trigger(Trigger.ProcessingTime(10.seconds)).
start
Record based streaming
This uses the ForeachWriter
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import scala.concurrent.duration._
val url="jdbc:postgresql://gsc-dev:5432/gpadmin"
val user ="gpadmin"
val pwd = "changeme"
val jdbcWriter = new JDBCSink(url,user, pwd)
val sq = spark.
readStream.
format("rate").
load.
writeStream.
format(jdbcWriter).
option("checkpointLocation", "/tmp/jdbc-checkpoint").
trigger(Trigger.ProcessingTime(10.seconds)).
start

Structured Streaming reading from a Kafka topic

I have read a csv file and converted the value field into bytes and wrote to a Kafka topic using Kafka producer application. Now I am trying to read from Kafka topic using Structured Streaming but not able to apply the custom kryo deserialization on value field.
Can anyone tell me how to use the custom deserialization in Structured Streaming?
I had a similar problem, basically, I had all my Kafka's messages on Protobuf and I solve that with UDF.
from pyspark.sql.functions import udf
def deserialization_function(message):
#You need to add your code to deserialize your messages
#I returned a json but you can return other structure
json = {"x": x_deserializable,
"y": y_deserializable,
"w": w_deserializable,
"z": z_deserializable,
return json
schema = StructType() \
.add("x", TimestampType()) \
.add("y", StringType()) \
.add("z", StringType()) \
.add("w", StringType())
own_udf = udf(deserialization_function, schema)
stream = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
.option("subscribe", topic) \
.load()
query = stream \
.select(col("value")) \
.select((own_udf("value")).alias("value_udf")) \
.select("value_udf.x", "value_udf.y", "value_udf.w", "value_udf.z")