Writestreaming query not writing data into table in databricks - pyspark

Can someone help me with this issue.
I have a delta table "orders". this table is loaded with 1000 records using the delta file.
Now we are getting the streaming JSON file which is appending the data into this table.
Readstream dataframe **orderInputDF **:
from pyspark.sql.functions import *
orderInputDF = (spark
.readStream
.schema(jsonSchema)
.option("maxFilesPerTrigger", 1)
.json(stream_path).
withColumn("submitted_at",to_timestamp('submittedAt')).
withColumn("submitted_yyyy_mm", col('submitted_at').substr(0,7)).
withColumnRenamed("orderId",'order_id').
withColumnRenamed("customerId","customer_id").
withColumnRenamed("salesRepId","sales_rep_id").
withColumn("shipping_address_attention", col("shippingAddress.attention")).
withColumn("shipping_address_address", col("shippingAddress.address")).
withColumn("shipping_address_city", col("shippingAddress.city")).
withColumn("shipping_address_state", col("shippingAddress.state")).
withColumn("shipping_address_zip", col("shippingAddress.zip")).
withColumn("ingest_file_name",lit(files[0][0])).
withColumn("ingested_at",current_timestamp()).
drop("products","shippingAddress","submittedAt")
)
This readstream dataframe is returning 20 records.
Writestream Query is as below
orderOutputDF = (orderInputDF
.writeStream
.format("delta")
.format("memory")
.queryName(orders_table)
.outputMode("append")
.option("checkpointLocation", orders_checkpoint_path)
.start()
)
This query is not appending the data. Everytime it is inserting the new records and deleting the old records.
Can someone help me on this.

Related

Counting unique values on grouped data in a Spark Dataframe with Structured Streaming on Delta Lake

everyone.
I have a structured streaming in a Delta Lake.
My last table is supposed to count how many unique IDs access a platform per week.
I`m grouping the data by week in the streaming, however, I cannot count the unique values of IDs on the other column and I keep getting the count of the whole bunch even if repeated instead.
I have tried grouping the data twice, by week and then device_id.
I have tried dropDuplicate().
Nothing has worked out so far.
Can someone explain me what am I missing?
My code:
from pyspark.sql.functions import weekofyear, col
def silverToGold(silverPath, goldPath, queryName):
(spark.readStream
.format("delta")
.load(silverPath)
.withColumn("week", weekofyear("client_event_time"))
.groupBy(col("week"))
.count()
.select(col("week"),col("count").alias("WAU"))
.writeStream
.format("delta")
.option("checkpointLocation", goldPath + "/_checkpoint")
.queryName(queryName)
.outputMode("complete")
.start(goldPath))
The following code worked out.
Used approx_count_distinct and rsd=0.01.
from pyspark.sql.functions import weekofyear, approx_count_distinct
def silverToGold(silverPath, goldPath, queryName):
(spark.readStream
.format("delta")
.load(silverPath)
.withColumn("week", weekofyear(col("eventDate")))
.groupBy(col("week"))
.agg(approx_count_distinct("device_id",rsd=0.01).alias("WAU"))
.select("week","WAU")
.writeStream
.format("delta")
.option("checkpointLocation", goldPath + "/_checkpoint")
.queryName(queryName)
.outputMode("complete")
.start(goldPath))

Multiple aggregations and Distinct Function in Spark Structured Streaming

I need to make some aggregations on streaming data from Kafka and output top 10 rows of result to console every M seconds.
input_df = (
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", "page_views")
.load()
.selectExpr('cast(value as string)')
)
...
...
# info has 2 cols: domain, uid (info = transformation of input_df)
# It's an example of what I want to do (like in simple pyspark)
stat = (
info
.groupby('domain')
.agg(
F.count(F.col('UID')).alias('view'),
F.countDistinct(F.col('UID')).alias('unique')
)
.sort(F.col("view").desc())
.limit(10)
)
query = (
stat
.writeStream
.outputMode("complete")
.format("console")
.option("truncate", "true")
.start()
)
This example without time trigger, but I can make it by myself.
Because of it's not allowed to use countDistinct, I haven't ideas of making my exercise.
I tried to make 2 dfs for each aggregation(df_1 = (domain, view), df_2 = (domain, unique)) and then join df_1 with df_2, but it's also not allowed to have several aggregations. So it's dead end for me.
It will be cool to have decision for it.
Thanks for your attention!
You can make it by flatMapGroupWithState, which is arbitrary state function.Besides,it supports append mode and update mode.

How to join two streaming datasets when one dataset involves aggregation

I am getting below error in below code snippet -
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Append output mode not supported when there are streaming aggregations
on streaming DataFrames/DataSets without watermark;;
Below is my input schema
val schema = new StructType()
.add("product",StringType)
.add("org",StringType)
.add("quantity", IntegerType)
.add("booked_at",TimestampType)
Creating streaming source dataset
val payload_df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test1")
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as payload")
.select(from_json(col("payload"), schema).as("data"))
.select("data.*")
Creating another streaming dataframe where aggregation is done and then joining it with original source dataframe to filter out records
payload_df.createOrReplaceTempView("orders")
val stage_df = spark.sql("select org, product, max(booked_at) as booked_at from orders group by 1,2")
stage_df.createOrReplaceTempView("stage")
val total_qty = spark.sql(
"select o.* from orders o join stage s on o.org = s.org and o.product = s.product and o.booked_at > s.booked_at ")
Finally, I was trying to display results on console with Append output mode. I am not able to figure out where I need to add watermark or how to resolve this. My objective is to filter out only those events in every trigger which have higher timestamp then the maximum timestamp received in any of the earlier triggers
total_qty
.writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination()
With spark structured streaming you can make aggregation directly on stream only with watermark. If you have a column with the timestamp of the event you can do it like this:
val payload_df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test1")
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as payload")
.select(from_json(col("payload"), schema).as("data"))
.select("data.*")
.withWatermark("event_time", "1 minutes")
On queries with aggregation you have 3 types of outputs:
Append mode uses watermark to drop old aggregation state. But the output of a windowed aggregation is delayed the late threshold
specified in withWatermark() as by the modes semantics, rows can be
added to the Result Table only once after they are finalized (i.e.
after watermark is crossed). See the Late Data section for more
details.
Update mode uses watermark to drop old aggregation state.
Complete mode does not drop old aggregation state since by definition this mode preserves all data in the Result Table.
Edit later:
You have to add window on your groupBy method.
val aggFg = payload_df.groupBy(window($"event_time", "1 minute"), $"org", $"product")
.agg(max(booked_at).as("booked_at"))

pyspark : Spark Streaming via socket

I am trying to read the datastream from the socket (ps.pndsn.com) and writing it into the temp_table for further processing but currently, issue I am facing is that temp_table which I created as part of writeStream is empty even though streaming is happening at real-time. So looking for help in this regard.
Below is the code snippet :
# Create DataFrame representing the stream of input streamingDF from connection to ps.pndsn.com:9999
streamingDF = spark \
.readStream \
.format("socket") \
.option("host", "ps.pndsn.com") \
.option("port", 9999) \
.load()
# Is this DF actually a streaming DF?
streamingDF.isStreaming
spark.conf.set("spark.sql.shuffle.partitions", "2") # keep the size of shuffles small
query = (
streamingDF
.writeStream
.format("memory")
.queryName("temp_table") # temp_table = name of the in-memory table
.outputMode("Append") # Append = OutputMode in which only the new rows in the streaming DataFrame/Dataset will be written to the sink
.start()
)
Streaming output :
{'channel': 'pubnub-sensor-network',
'message': {'ambient_temperature': '1.361',
'humidity': '81.1392',
'photosensor': '758.82',
'radiation_level': '200',
'sensor_uuid': 'probe-84d85b75',
'timestamp': 1581332619},
'publisher': None,
'subscription': None,
'timetoken': 15813326199534409,
'user_metadata': None}
The output of the temp_table is empty.

Spark Dataframe upsert to Elasticsearch

I am using Apache Spark DataFrame and I want to upsert data to Elasticsearch
and I found I can overwrite them like this
val df = spark.read.option("header","true").csv("/mnt/data/akc_breed_info.csv")
df.write
.format("org.elasticsearch.spark.sql")
.option("es.nodes.wan.only","true")
.option("es.port","443")
.option("es.net.ssl","true")
.option("es.nodes", esURL)
.option("es.mapping.id", index)
.mode("Overwrite")
.save("index/dogs")
but what i noticed so far is this command mode("Overwrite") is actually delete all existing duplicated data and insert the new data
is there a way I can upsert them not delete and re-write them ? because I need to query those data almost real time. thanks in advance
The reason why mode("Overwrite") was a problem is that when you overwrite your entire dataframe it deletes all data that matches with your rows of dataframe at once and it looks like the entire index is empty for me and I figure out how to actually upsert it
here is my code
df.write
.format("org.elasticsearch.spark.sql")
.option("es.nodes.wan.only","true")
.option("es.nodes.discovery", "false")
.option("es.nodes.client.only", "false")
.option("es.net.ssl","true")
.option("es.mapping.id", index)
.option("es.write.operation", "upsert")
.option("es.nodes", esURL)
.option("es.port", "443")
.mode("append")
.save(path)
Note that you have to put "es.write.operation", "upert" and .mode("append")
Try setting:
es.write.operation = upsert
This should perform the required operation. You can find more details in https://www.elastic.co/guide/en/elasticsearch/hadoop/current/configuration.html