pyspark streaming DStreams to kafka topic

pyspark streaming DStreams to kafka topic - pyspark

As simple as it gets, is it possible to Stream DStream to a Kafka topic?
I have Spark streaming job which does all the data processing, now I want to push the data to a Kafka topic. Is it possible to do so in pyspark?

better convert to json before writing to kafka otherwise specify key and value columns that are being written to kafka.
query = jdf.selectExpr("to_json(struct(*)) AS value")\
.writeStream\
.format("kafka")\
.option("zookeeper.connect", "localhost:2181")\
.option("kafka.bootstrap.servers", "localhost:9092")\
.option("topic", "test-spark")\
.option("checkpointLocation", "/root/")\
.outputMode("append")\
.start()

If your message in AVRO format , we can serazlie messages and write in kafka directly .
from pyspark import SparkConf, SparkContext
from kafka import KafkaProducer
from kafka.errors import KafkaError
from pyspark.sql import SQLContext, SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
from kafka import SimpleProducer, KafkaClient
from kafka import KafkaProducer
from pyspark.streaming.kafka import KafkaUtils, OffsetRange, TopicAndPartition
import avro.schema
from confluent_kafka.avro.cached_schema_registry_client import CachedSchemaRegistryClient
from confluent_kafka.avro.serializer.message_serializer import MessageSerializer
import pandas as pd
ssc = StreamingContext(sc, 2)
ssc = StreamingContext(sc, 2)
topic = "test"
brokers = "localhost:9092"
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
kvs.foreachRDD(handler)
def handler(message):
records = message.collect()
for record in records:
<Data processing whatever you want and creating the var_val_value,var_val_key pair >
var_kafka_parms_tgt = {'bootstrap.servers': var_bootstrap_servr,'schema.registry.url': var_schema_url}
avroProducer = AvroProducer(var_kafka_parms_tgt,default_key_schema=key_schema, default_value_schema=value_schema)
avroProducer.produce(topic=var_topic_tgt_name, value=var_val_value, key=var_val_key)
avroProducer.flush()

Related

Kafkastream is not listening on input when created through SparkSession Builder

I am trying to create a Kafka Consumer which uses MongoDB-Spark-Connector in the same program. Something like Kafka input as RDD --> to Dataframe and then store it in the MongoDB for later use.
My Producer is up and running and the "standard" consumer looks like this and gets the messages nicely:
# Spark
from pyspark import SparkContext
# Spark Streaming
from pyspark.streaming import StreamingContext
# Kafka
from pyspark.streaming.kafka import KafkaUtils
# json parsing
import json
sc = SparkContext(appName="PythonSparkStreamingKafka_RM_01")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 30)
kafkaStream = KafkaUtils.createStream(ssc, 'localhost:2181', 'spark-streaming-consumer', {'trump':1})
parsed = kafkaStream.map(lambda v: json.loads(v[1]))
parsed.pprint()
ssc.start()
ssc.awaitTermination()
The "modified" consumer I want to use, which is built through SparkSessionBuilder with the config options to use mongodb looks like this:
#Additional for Session Building and Preprocessing
from pyspark import SparkContext
from pyspark import SQLContext
from pyspark.sql import SparkSession
import collections
# Spark Streaming
from pyspark.streaming import StreamingContext
# Kafka
from pyspark.streaming.kafka import KafkaUtils
# json parsing
import json
# Build the SparkSession
spark = SparkSession.builder \
.master("local") \
.appName("TrumpTweets") \
.config("spark.executor.memory", "1gb") \
.config("spark.mongodb.input.uri", "mongodb://127.0.0.1/trumptweets.tweets") \
.config("spark.mongodb.output.uri", "mongodb://127.0.0.1/trumptweets.tweets") \
.getOrCreate()
ssc = StreamingContext(spark.sparkContext, 30)
kafkaStream = KafkaUtils.createStream(ssc, 'localhost:2181', 'spark-streaming-consumer', {'trump':1})
parsed = kafkaStream.map(lambda v: json.loads(v[1]))
parsed.pprint()
ssc.start()
ssc.awaitTermination()
it runs nicely, but does not receive any messages... I don't see anything different other than the SessionBuilder, which doesn't produce any error messages or so.
Please help me out, I'm really stuck on this one... Any other way is also appreciated!

Reading avro messages from Kafka in spark streaming/structured streaming

I am using pyspark for the first time.
Spark Version : 2.3.0
Kafka Version : 2.2.0
I have a kafka producer which sends nested data in avro format and I am trying to write code in spark-streaming/ structured streaming in pyspark which will deserialize the avro coming from kafka into dataframe do transformations write it in parquet format into s3.
I was able to find avro converters in spark/scala but support in pyspark has not yet been added. How do I convert the same in pyspark.
Thanks.

As like you mentioned , Reading Avro message from Kafka and parsing through pyspark, don't have direct libraries for the same . But we can read/parsing Avro message by writing small wrapper and call that function as UDF in your pyspark streaming code as below .
Reference :
Pyspark 2.4.0, read avro from kafka with read stream - Python
Note: Avro is built-in but external data source module since Spark 2.4. Please deploy the application as per the deployment section of "Apache Avro Data Source Guide".
Refererence: https://spark-test.github.io/pyspark-coverage-site/pyspark_sql_avro_functions_py.html
Spark-Submit :
[adjust the package versions to match spark/avro version based installation]
/usr/hdp/2.6.1.0-129/spark2/bin/pyspark --packages org.apache.spark:spark-avro_2.11:2.4.3 --conf spark.ui.port=4064
Pyspark Streaming Code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.streaming import StreamingContext
from pyspark.sql.column import Column, _to_java_column
from pyspark.sql.functions import col, struct
from pyspark.sql.functions import udf
import json
import csv
import time
import os
# Spark Streaming context :
spark = SparkSession.builder.appName('streamingdata').getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 20)
# Kafka Topic Details :
KAFKA_TOPIC_NAME_CONS = "topicname"
KAFKA_OUTPUT_TOPIC_NAME_CONS = "topic_to_hdfs"
KAFKA_BOOTSTRAP_SERVERS_CONS = 'localhost.com:9093'
# Creating readstream DataFrame :
df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BOOTSTRAP_SERVERS_CONS) \
.option("subscribe", KAFKA_TOPIC_NAME_CONS) \
.option("startingOffsets", "latest") \
.option("failOnDataLoss" ,"false")\
.option("kafka.security.protocol","SASL_SSL")\
.option("kafka.client.id" ,"MCI-CIL")\
.option("kafka.sasl.kerberos.service.name","kafka")\
.option("kafka.ssl.truststore.location", "/path/kafka_trust.jks") \
.option("kafka.ssl.truststore.password", "changeit") \
.option("kafka.sasl.kerberos.keytab","/path/bdpda.headless.keytab") \
.option("kafka.sasl.kerberos.principal","bdpda") \
.load()
df1 = df.selectExpr( "CAST(value AS STRING)")
df1.registerTempTable("test")
# Deserilzing the Avro code function
from pyspark.sql.column import Column, _to_java_column
def from_avro(col):
jsonFormatSchema = """
{
"type": "record",
"name": "struct",
"fields": [
{"name": "col1", "type": "long"},
{"name": "col2", "type": "string"}
]
}"""
sc = SparkContext._active_spark_context
avro = sc._jvm.org.apache.spark.sql.avro
f = getattr(getattr(avro, "package$"), "MODULE$").from_avro
return Column(f(_to_java_column(col), jsonFormatSchema))
spark.udf.register("JsonformatterWithPython", from_avro)
squared_udf = udf(from_avro)
df1 = spark.table("test")
df2 = df1.select(squared_udf("value"))
# Declaring the Readstream Schema DataFrame :
df2.coalesce(1).writeStream \
.format("parquet") \
.option("checkpointLocation","/path/chk31") \
.outputMode("append") \
.start("/path/stream/tgt31")
ssc.awaitTermination()

Data source io.pivotal.greenplum.spark.GreenplumRelationProvider does not support streamed writing

I am trying to read data from kafka and upload it into greenplum database using spark. i am using greenplum-spark connecter but i am getting Data source io.pivotal.greenplum.spark.GreenplumRelationProvider does not support streamed writing.
Is it that greenplum source doesnot support streaming data? I can see on the website saying "Continuous ETL pipeline (streaming)".
I have tried giving datasource as "greenplum" and "io.pivotal.greenplum.spark.GreenplumRelationProvider" into .format("datasource")
val EventStream = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", args(0))
.option("subscribe", args(1))
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "false")
.load
val gscWriteOptionMap = Map(
"url" -> "link for greenplum",
"user" -> "****",
"password" -> "****",
"dbschema" -> "dbname"
)
val stateEventDS = EventStream
.selectExpr("CAST(key AS String)", "*****(value)")
.as[(String, ******)]
.map(_._2)
val EventOutputStream = stateEventDS.writeStream
.format("io.pivotal.greenplum.spark.GreenplumRelationProvider")
.options(gscWriteOptionMap)
.start()
assetEventOutputStream.awaitTermination()

What version of GPDB / Spark are you using?
You could bypass spark in favor of the Greenplum-Kafka connector.
https://gpdb.docs.pivotal.io/5170/greenplum-kafka/overview.html
In earlier versions, the Greenplum-Spark Connector exposed a Spark data source named io.pivotal.greenplum.spark.GreenplumRelationProvider to read data from Greenplum Database into a Spark DataFrame.
In later versions, the connector exposes a Spark data source named greenplum to transfer data between Spark and Greenplum Database.
Should be something like --
val EventOutputStream = stateEventDS.write.format("greenplum")
.options(gscWriteOptionMap)
.save()
See: https://greenplum-spark.docs.pivotal.io/160/write_to_gpdb.html

Greenplum Spark Structured Streaming
Demonstrates how to use writeStream API with GPDB using JDBC
The following code block reads using a rate stream source and uses the JDBC based sink to stream in batches to GPDB
Batch based streaming
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import scala.concurrent.duration._
val sq = spark.
readStream.
format("rate").
load.
writeStream.
format("myjdbc").
option("checkpointLocation", "/tmp/jdbc-checkpoint").
trigger(Trigger.ProcessingTime(10.seconds)).
start
Record based streaming
This uses the ForeachWriter
import org.apache.spark.sql.streaming._
import org.apache.spark.sql.streaming.Trigger
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import scala.concurrent.duration._
val url="jdbc:postgresql://gsc-dev:5432/gpadmin"
val user ="gpadmin"
val pwd = "changeme"
val jdbcWriter = new JDBCSink(url,user, pwd)
val sq = spark.
readStream.
format("rate").
load.
writeStream.
format(jdbcWriter).
option("checkpointLocation", "/tmp/jdbc-checkpoint").
trigger(Trigger.ProcessingTime(10.seconds)).
start

pyspark - kafka streaming integration

I would like to handle empty batches or files when there is no input from the producer. below code is throwing an error:
'TransformedDStream' object has no attribute 'isEmpty'
So, how do i need to handle empty files?
Consumer code:
import sys
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql.context import SQLContext
if __name__ == '__main__':
if len(sys.argv) != 3:
print("Usage: kafka_wordcount.py <zk> <topic>")
exit(-1)
sc = SparkContext(appName="PythonStreamingKafkaWordCount")
ssc = StreamingContext(sc, 60)
zkQuorum, topic = sys.argv[1:]
kvs = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 2})
lines = kvs.map(lambda x: x[1])
if(not lines.isEmpty()):
lines.saveAsTextFiles("/user/cloudera/")
ssc.start()
ssc.awaitTermination()

Avoid multiple connections to mongoDB from spark streaming

We developed a spark streaming application that sources data from kafka and writes to mongoDB. We are noticing performance implications while creating connections inside foreachRDD on the input DStream. The spark streaming application does a few validations before inserting into mongoDB. We are exploring options to avoid connecting to mongoDB for each message that is processed, rather we desire to process all messages within one batch interval at once. Following is the simplified version of the spark streaming application. One of the things we did is append all the messages to a dataframe and try inserting the contents of that dataframe outside of the foreachRDD. But when we run this application, the code that writes dataframe to mongoDB does not get executed.
Please note that I commented out a part of the code inside foreachRDD which we used to insert each message into mongoDB. Existing approach is very slow as we are inserting one message at a time. Any suggestions on performance improvement is much appreciated.
Thank you
package com.testing
import org.apache.spark.streaming._
import org.apache.spark.sql.SparkSession
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.{ SparkConf, SparkContext }
import org.apache.spark.streaming.kafka._
import org.apache.spark.sql.{ SQLContext, Row, Column, DataFrame }
import java.util.HashMap
import org.apache.kafka.clients.producer.{ KafkaProducer, ProducerConfig, ProducerRecord }
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.joda.time._
import org.joda.time.format._
import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._
import com.mongodb.util.JSON
import scala.io.Source._
import java.util.Properties
import java.util.Calendar
import scala.collection.immutable
import org.json4s.DefaultFormats
object Sample_Streaming {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("Sample_Streaming")
.setMaster("local[4]")
val sc = new SparkContext(sparkConf)
sc.setLogLevel("ERROR")
val sqlContext = new SQLContext(sc)
val ssc = new StreamingContext(sc, Seconds(1))
val props = new HashMap[String, Object]()
val bootstrap_server_config = "127.0.0.100:9092"
val zkQuorum = "127.0.0.101:2181"
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, bootstrap_server_config)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer")
val TopicMap = Map("sampleTopic" -> 1)
val KafkaDstream = KafkaUtils.createStream(ssc, zkQuorum, "group", TopicMap).map(_._2)
val schemaDf = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource")
.option("spark.mongodb.input.uri", "connectionURI")
.option("spark.mongodb.input.collection", "schemaCollectionName")
.load()
val outSchema = schemaDf.schema
var outDf = sqlContext.createDataFrame(sc.emptyRDD[Row], outSchema)
KafkaDstream.foreachRDD(rdd => rdd.collect().map { x =>
{
val jsonInput: JValue = parse(x)
/*Do all the transformations using Json libraries*/
val json4s_transformed = "transformed json"
val rdd = sc.parallelize(compact(render(json4s_transformed)) :: Nil)
val df = sqlContext.read.schema(outSchema).json(rdd)
//Earlier we were inserting each message into mongoDB, which we would like to avoid and process all at once
/* df.write.option("spark.mongodb.output.uri", "connectionURI")
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()*/
outDf = outDf.union(df)
}
}
)
//Added this part of the code in expectation to access the unioned dataframe and insert all messages at once
//println(outDf.count())
if(outDf.count() > 0)
{
outDf.write
.option("spark.mongodb.output.uri", "connectionURI")
.option("collection", "Collection")
.mode("append").format("com.mongodb.spark.sql").save()
}
// Run the streaming job
ssc.start()
ssc.awaitTermination()
}
}

It sounds like you would want to reduce the number of connections to mongodb, for this purpose, you must use foreachPartition in code when you serve connection do mongodb see spec, the code will look like this:
rdd.repartition(1).foreachPartition {
//get instance of connection
//write/read with batch to mongo
//close connection
}