I am getting 0 records when reading batch data from kafka using spark - scala

val spark: SparkSession = SparkSession.builder
.master("local[3]")
.appName("Kafka testing")
.config("spark.streaming.stopGracefullyOnShutdown", "true")
.getOrCreate()
val df = spark.read
.format("kafka")
.option("kafka.security.protocol", "SSL")
.option("kafka.ssl.truststore.location", "certs/truststore.jks")
.option("kafka.ssl.keystore.location", "certs/keystore.jks")
.option("kafka.ssl.key.password", "somePassword")
.option("kafka.ssl.keystore.password", "somePassword")
.option("kafka.ssl.truststore.password", "somePassword")
.option(
"kafka.bootstrap.servers",
"server1:16501,server2:16501,server3:16501"
)
.option("kafka.group.id", "dev-nimbus-udw")
.option("subscribe", "myTopic")
.option("startingOffsets","""{"myTopic":{"0":1193954,"1":1211438,"3":1203538,"2":1077955}}""" )
.option("endingOffsets", """{"myTopic":{"0":1193994,"1":1211478,"3":1203578,"2":1077995}}""")
.load()
//Just printing
println(s"The count is | The number of records read is ${df.count()}")
println(s"The number of partitions of this dataframe is ${df.rdd.getNumPartitions}")
df.select(col("value"))
.write
.format("json")
.mode(SaveMode.Overwrite)
.save("readSpecificCountsBatch")
I tried reading specific set of offsets data from kafka using spark structured streaming.
I specified the offsets range and i confirmed that they are present.
I am getting zero records in output folder.

Related

Can I "branch" stream into many and write them in parallel in pyspark?

I am receiving Kafka stream in pyspark. Currently I am grouping it by one set of fields and writing updates to database:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", config["kafka"]["bootstrap.servers"]) \
.option("subscribe", topic)
...
df = df \
.groupBy("myfield1") \
.agg(
expr("count(*) as cnt"),
min(struct(col("mycol.myfield").alias("mmm"), col("*"))).alias("minData")
) \
.select("cnt", "minData.*") \
.select(
col("...").alias("..."),
...
col("userId").alias("user_id")
query = df \
.writeStream \
.outputMode("update") \
.foreachBatch(lambda df, epoch: write_data_frame(table_name, df, epoch)) \
.start()
query.awaitTermination()
Can I take the same chain in the middle and create another grouping like
df2 = df \
.groupBy("myfield2") \
.agg(
expr("count(*) as cnt"),
min(struct(col("mycol.myfield").alias("mmm"), col("*"))).alias("minData")
) \
.select("cnt", "minData.*") \
.select(
col("...").alias("..."),
...
col("userId").alias("user_id")
and write it's ooutput into different place in parallel?
Where to call writeStream and awaitTermination?
Yes, you can branch a Kafka input stream into as many streaming queries as you like.
You need to consider the following:
query.awaitTermination is a blocking method, which means whatever code you are writing after this method will not be executed until this query gets terminated.
Each "branched" streaming query will run in parallel and is it important that you define a checkpoint location in each of their writeStream calls.
Overall, your code needs to have the following structure:
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", config["kafka"]["bootstrap.servers"]) \
.option("subscribe", topic) \
.[...]
# note that I changed the variable name to "df1"
df1 = df \
.groupBy("myfield1") \
.[...]
df2 = df \
.groupBy("myfield2") \
.[...]
query1 = df1 \
.writeStream \
.outputMode("update") \
.option("checkpointLocation", "/tmp/checkpointLoc1") \
.foreachBatch(lambda df, epoch: write_data_frame(table_name, df1, epoch)) \
.start()
query2 = df2 \
.writeStream \
.outputMode("update") \
.option("checkpointLocation", "/tmp/checkpointLoc2") \
.foreachBatch(lambda df, epoch: write_data_frame(table_name, df2, epoch)) \
.start()
spark.streams.awaitAnyTermination
Just an additional remark: In the code you are showing, you are overwriting df, so the derivation of df2 might not get you the results as you were intended.

How to join two streaming datasets when one dataset involves aggregation

I am getting below error in below code snippet -
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Append output mode not supported when there are streaming aggregations
on streaming DataFrames/DataSets without watermark;;
Below is my input schema
val schema = new StructType()
.add("product",StringType)
.add("org",StringType)
.add("quantity", IntegerType)
.add("booked_at",TimestampType)
Creating streaming source dataset
val payload_df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test1")
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as payload")
.select(from_json(col("payload"), schema).as("data"))
.select("data.*")
Creating another streaming dataframe where aggregation is done and then joining it with original source dataframe to filter out records
payload_df.createOrReplaceTempView("orders")
val stage_df = spark.sql("select org, product, max(booked_at) as booked_at from orders group by 1,2")
stage_df.createOrReplaceTempView("stage")
val total_qty = spark.sql(
"select o.* from orders o join stage s on o.org = s.org and o.product = s.product and o.booked_at > s.booked_at ")
Finally, I was trying to display results on console with Append output mode. I am not able to figure out where I need to add watermark or how to resolve this. My objective is to filter out only those events in every trigger which have higher timestamp then the maximum timestamp received in any of the earlier triggers
total_qty
.writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination()
With spark structured streaming you can make aggregation directly on stream only with watermark. If you have a column with the timestamp of the event you can do it like this:
val payload_df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test1")
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as payload")
.select(from_json(col("payload"), schema).as("data"))
.select("data.*")
.withWatermark("event_time", "1 minutes")
On queries with aggregation you have 3 types of outputs:
Append mode uses watermark to drop old aggregation state. But the output of a windowed aggregation is delayed the late threshold
specified in withWatermark() as by the modes semantics, rows can be
added to the Result Table only once after they are finalized (i.e.
after watermark is crossed). See the Late Data section for more
details.
Update mode uses watermark to drop old aggregation state.
Complete mode does not drop old aggregation state since by definition this mode preserves all data in the Result Table.
Edit later:
You have to add window on your groupBy method.
val aggFg = payload_df.groupBy(window($"event_time", "1 minute"), $"org", $"product")
.agg(max(booked_at).as("booked_at"))

pyspark : Spark Streaming via socket

I am trying to read the datastream from the socket (ps.pndsn.com) and writing it into the temp_table for further processing but currently, issue I am facing is that temp_table which I created as part of writeStream is empty even though streaming is happening at real-time. So looking for help in this regard.
Below is the code snippet :
# Create DataFrame representing the stream of input streamingDF from connection to ps.pndsn.com:9999
streamingDF = spark \
.readStream \
.format("socket") \
.option("host", "ps.pndsn.com") \
.option("port", 9999) \
.load()
# Is this DF actually a streaming DF?
streamingDF.isStreaming
spark.conf.set("spark.sql.shuffle.partitions", "2") # keep the size of shuffles small
query = (
streamingDF
.writeStream
.format("memory")
.queryName("temp_table") # temp_table = name of the in-memory table
.outputMode("Append") # Append = OutputMode in which only the new rows in the streaming DataFrame/Dataset will be written to the sink
.start()
)
Streaming output :
{'channel': 'pubnub-sensor-network',
'message': {'ambient_temperature': '1.361',
'humidity': '81.1392',
'photosensor': '758.82',
'radiation_level': '200',
'sensor_uuid': 'probe-84d85b75',
'timestamp': 1581332619},
'publisher': None,
'subscription': None,
'timetoken': 15813326199534409,
'user_metadata': None}
The output of the temp_table is empty.

Why could streaming join of queries over kafka topics take so long?

I'm using Spark Structured Streaming and joining two streams from Kafka topics.
I noticed that the streaming query takes around 15 seconds for each record. In the below screenshot, the stage id 2 takes 15s. Why could that be?
The code is as follows:
val kafkaTopic1 = "demo2"
val kafkaTopic2 = "demo3"
val bootstrapServer = "localhost:9092"
val spark = SparkSession
.builder
.master("local")
.getOrCreate
import spark.implicits._
val df1 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServer)
.option("subscribe", kafkaTopic1)
.option("failOnDataLoss", false)
.load
val df2 = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServer)
.option("subscribe", kafkaTopic2)
.option("failOnDataLoss", false)
.load
val order_details = df1
.withColumn(...)
.select(...)
val invoice_details = df2
.withColumn(...)
.where(...)
order_details
.join(invoice_details)
.where(order_details.col("s_order_id") === invoice_details.col("order_id"))
.select(...)
.writeStream
.format("console")
.option("truncate", false)
.start
.awaitTermination()
Code-wise everything works fine. The only problem is the time to join the two streams. How could this query be optimised?
It's fairly possible that the execution time is not satisfactory given the master URL, i.e. .master("local"). Change it to local[*] at the very least and you should find the join faster.

Union operation in spark running very slow

I'm running a spark sql with below statements and configuration but apparently dfs.reduce((x, y) => x.union(y)).distinct().coalesce(1) step is taking a lot of time to execute,roughly 5 mins, my input parquet file has just 88 records. Any thoughts what could be the issue ?
val spark = SparkSession
.builder()
.appName("SparkSessionZipsExample")
.config("spark.sql.warehouse.dir", warehouseLocation)
.config("spark.master", "local")
.config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
//set new runtime options
spark.conf.set("spark.sql.shuffle.partitions", 6)
spark.conf.set("spark.executor.memory", "2g")
spark.conf.set("spark.driver.host", "localhost")
spark.conf.set("spark.cores.max", "8")
val dfs = m.map(field => spark.sql(s"select 'DataProfilerStats' as Table_Name,
'$field' as Column_Name,min($field) as min_value from parquetDFTable"))
val withSum = dfs.reduce((x, y) => x.union(y)).distinct().coalesce(1)
UPDATE
I have a single parquet which I'm reading into dataframe, question is also that if it can be split into smaller chunks.