Parse Kafka JSON stream using pyspark and save to mongodb - mongodb

I have a stream of alerts coming from Kafka to Spark. These are alerts in JSON format from different IoT Sensors.
Kafka Streams:
{ "id":"2093021", alert:"Malfunction
detected","sensor_id":"14-23092-AS" }
{ "id":"2093021", alert:"Malfunction
detected","sensor_id":"14-23092-AS" , "alarm_code": "Severe" }
My code: spark-client.py
from __future__ import print_function
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
from pyspark.sql import SparkSession
from pyspark.sql.context import SQLContext
import json
if __name__ == "__main__":
spark = SparkSession.builder.appName("myApp").config("spark.mongodb.input.uri", "mongodb://spark:1234#172.31.9.44/at_cloudcentral.spark_test").config("spark.mongodb.output.uri", "mongodb://spark:1234#172.31.9.44/at_cloudcentral.spark_test").getOrCreate()
sc = spark.sparkContext
ssc = StreamingContext(sc, 10)
zkQuorum, topic = sys.argv[1:]
kafka_streams =KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-sql-mongodb-test-consumer", {topic: 1})
dstream = kafka_streams.map(lambda x: json.loads(x[1]))
dstream.pprint()
ssc.start()
ssc.awaitTermination()
When I run this
ubuntu#ip-172-31-89-176:~/spark-connectors$ spark-submit spark-client.py localhost:2181 DetectionEntry
I get this output
-------------------------------------------
Time: 2019-12-04 14:26:40
-------------------------------------------
{u'sensor_id': u'16-23092-AS', u'id': u'2093021', u'alert': u'Malfunction detected'}
I need to be able to save this alert to a remote MongoDB. I have two specific challenges:
How do I correctly parse the output so that I can create a dataframe that can be written to mongodb ? I have tried adding this to the end of code
d = [dstream]
df = spark.createDataFrame(d).collect()
and it gives me this error
dataType py4j.java_gateway.JavaMember object at 0x7f5912726750 should
be an instance of class 'pyspark.sql.types.DataType'
My alerts can have different json structure and I'll need to dump them into a mongodb collection. As such a fixed schema wont work for me. Most of the similar questions and code that I have referred to in stackoverflow are specific to fixed schema and I'm unable to figure out how to push this to mongodb in a way that each record in the mongodb collection will have its own schema(json structure). Any pointers in the right direction is requested.

We can parsing the Kafka JSON message easily through Pyspark based Structured Streaming API with invoking the simple UDF. You can check complete code in below stack overflow link for reference.
Pyspark Structured streaming processing

Related

how to change json schema in spark spark streamning events without interrupting the streamning Job?

I have a use case in which I need to change the schema of JSON without interrupting the streaming job. I am using a conf file where I have all the required schema mentioned. I have already tried cache and broadcast variables by persisting and unpersisting with a separate streaming pipeline but still no luck. Thanks in advance for your help!
rather than reading the data set as json you can try reading it as text and then map it as per a schema that that coming externally from a config file in HDFS or a DB.
so instead of doing something like,
val df = spark.readStream.format("json").load(.. path ..)
do,
import sparkSession.implicits._
val df = spark.readStream
.format("text").load( .. path .. )
.select("value")
.as[String]
.mapPartitions(partStrings => {
val currentSchema = readSchemaFromFile(???)
partStrings.map(str => parseJSON(currentSchema, str))
})
mapPartitions prevents schema lookup on each record.

Total records processed in each micro batch spark streaming

Is there a way I can find how many records got processed into downstream delta table for each micro-batch. I've streaming job, which runs hourly once using trigger.once() with the append mode. For audit purpose, I want to know how many records got processed for each micro batch. I've tried the below code to print the count of records processed(shown in the second line).
ss_count=0
def write_to_managed_table(micro_batch_df, batchId):
#print(f"inside foreachBatch for batch_id:{batchId}, rows in passed dataframe: {micro_batch_df.count()}")
ss_count = micro_batch_df.count()
saveloc = "TABLE_PATH"
df_final.writeStream.trigger(once=True).foreachBatch(write_to_managed_table).option('checkpointLocation', f"{saveloc}/_checkpoint").start(saveloc)
print(ss_count)
Streaming job will run without any issues but micro_batch_df.count() will not print any count.
Any pointers would be much appreciated.
Here is a working example of what you are looking for (structured_steaming_example.py):
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("StructuredStreamTesting") \
.getOrCreate()
# Create DataFrame representing the stream of input
df = spark.read.parquet("data/")
lines = spark.readStream.schema(df.schema).parquet("data/")
def batch_write(output_df, batch_id):
print("inside foreachBatch for batch_id:{0}, rows in passed dataframe: {1}".format(batch_id, output_df.count()))
save_loc = "/tmp/example"
query = (lines.writeStream.trigger(once=True)
.foreachBatch(batch_write)
.option('checkpointLocation', save_loc + "/_checkpoint")
.start(save_loc)
)
query.awaitTermination()
The sample parquet file is attached. Please put that in the data folder and execute the code using spark-submit
spark-submit --master local structured_steaming_example.py
Please put any sample parquet file under data folder for testing.

spark HWC cannot write to an existing table

In HDP 3.1.0, HWC hive-warehouse-connector-assembly-1.0.0.3.1.0.0-78.jar, I cannot append (or overwrite) to an existing table depending on the database.
I tested on one datase called DSN, it works and on another database called CLEAN_CRYPT it fails.
Both databases are crypted + kerberos
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseSession._
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseSession
val hive = com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder.session(spark).build()
hive.execute("show databases").show()
hive.setDatabase("clean_crypt")
val df=hive.execute("select * from test")
df.write.format(HIVE_WAREHOUSE_CONNECTOR).option("table","test").mode("append").save
The error message is "table already exists". I tried overwrite mode without success.
If I drop the table, it passes !!!
Any idea ?
This is probably related to a HWC bug which is reported by multiple users here.
What I've found is that it only occurs if you try to use a partitionBy at writing, like:
df.write.partitionBy("part")
.mode(SaveMode.Overwrite)
.format(com.hortonworks.hwc.HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR)
.option("table", "`default`.`testout`").save;
On an other note, if you remove the partitionBy piece, partitioning works as expected (as partition info is already stored in the Hive table), but if you use overwrite mode (and not, for example, append), HWC will drop and recreate your table and it won't reapply partitioning info.
If you want to use the Hortnoworks connector and append to a partitioned table, you should not use partitionBy as it does not seem to work properly with this connector. Instead, you could use the partition options and add Spark parameters for dynamic partitioning.
Example:
import org.apache.spark.SparkConf
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseBuilder
import com.hortonworks.spark.sql.hive.llap.HiveWarehouseSession.HIVE_WAREHOUSE_CONNECTOR
import org.apache.spark.sql.{SaveMode, SparkSession}
val sparkConf = new SparkConf()
.setMaster("yarn")
.setAppName("My application")
.set("hive.exec.dynamic.partition", "true")
.set("hive.exec.dynamic.partition.mode", "nonstrict")
val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()
val hive = HiveWarehouseBuilder.session(spark).build()
val hiveDatabase = "clean_crypt")
hive.setDatabase(hiveDatabase)
val df = hive.execute("select * from test")
df
.write
.format(HIVE_WAREHOUSE_CONNECTOR)
.mode(SaveMode.Append)
.option("partition", partitionColumn)
.option("table", table)
.save()
For the above, the hive-warehouse-connector-assembly-1.0.0.3.1.4.0-315.jar was used. If the table does not exist, the connector creates it and stores it (by default) in ORC format.

Spark JDBC write to Salesforce error

I am trying to read data from Hive and Writing to a custom object in Salesforce using a JDBC drive for Salesforce from Progress. Here is how I am trying to do this
spark-shell --jars /usr/hdp/current/spark-client/lib/sforce.jar
import org.apache.spark.sql.hive._
val hc = new HiveContext(sc)
val results = hc.sql("select rep_name FROM schema.rpt_view")
print(results.first())
import org.apache.spark.sql.SaveMode
val url="jdbc:datadirect:sforce://login.salesforce.com"
val prop = new java.util.Properties
prop.put("user","user1")
prop.put("password","passwd")
prop.put("driver","com.ddtek.jdbc.sforce.SForceDriver")
results.write.mode(SaveMode.Append).jdbc(url,"SFORCE.test_tab1",prop)`
I am getting the error
`java.sql.SQLSyntaxErrorException: [DataDirect][SForce JDBC Driver][SForce]column size is required in statement [CREATE TABLE SFORCE.test_tab1 (rep_name TEXT`
Can some help me here .. if table test_tab1 already exists , how do I configure the write and also if table doesn't exist in Salesforce, how do I add a column value

How to write a Dataset to Kafka topic?

I am using Spark 2.1.0 and Kafka 0.9.0.
I am trying to push the output of a batch spark job to kafka. The job is supposed to run every hour but not as streaming.
While looking for an answer on the net I could only find kafka integration with Spark streaming and nothing about the integration with the batch job.
Does anyone know if such thing is feasible ?
Thanks
UPDATE :
As mentioned by user8371915, I tried to follow what was done in Writing the output of Batch Queries to Kafka.
I used a spark shell :
spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0
Here is the simple code that I tried :
val df = Seq(("Rey", "23"), ("John", "44")).toDF("key", "value")
val newdf = df.select(to_json(struct(df.columns.map(column):_*)).alias("value"))
newdf.write.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("topic", "alerts").save()
But I get the error :
java.lang.RuntimeException: org.apache.spark.sql.kafka010.KafkaSourceProvider does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:497)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
... 50 elided
Have any idea what is this related to ?
Thanks
tl;dr You use outdated Spark version. Writes are enabled in 2.2 and later.
Out-of-the-box you can use Kafka SQL connector (the same as used with Structured Streaming). Include
spark-sql-kafka in your dependencies.
Convert data to DataFrame containing at least value column of type StringType or BinaryType.
Write data to Kafka:
df
.write
.format("kafka")
.option("kafka.bootstrap.servers", server)
.save()
Follow Structured Streaming docs for details (starting with Writing the output of Batch Queries to Kafka).
If you have a dataframe and you want to write it to a kafka topic, you need to convert columns first to a "value" column that contains data in a json format. In scala it is
import org.apache.spark.sql.functions._
val kafkaServer: String = "localhost:9092"
val topicSampleName: String = "kafkatopic"
df.select(to_json(struct("*")).as("value"))
.selectExpr("CAST(value AS STRING)")
.write
.format("kafka")
.option("kafka.bootstrap.servers", kafkaServer)
.option("topic", topicSampleName)
.save()
For this error
java.lang.RuntimeException: org.apache.spark.sql.kafka010.KafkaSourceProvider does not allow create table as select.
at scala.sys.package$.error(package.scala:27)
I think you need to parse the message to Key value pair. Your dataframe should have value column.
Let say if you have a dataframe with student_id, scores.
df.show()
>> student_id | scores
1 | 99.00
2 | 98.00
then you should modify your dataframe to
value
{"student_id":1,"score":99.00}
{"student_id":2,"score":98.00}
To convert you can use similar code like this
df.select(to_json(struct($"student_id",$"score")).alias("value"))