pyspark : Spark Streaming via socket - pyspark

I am trying to read the datastream from the socket (ps.pndsn.com) and writing it into the temp_table for further processing but currently, issue I am facing is that temp_table which I created as part of writeStream is empty even though streaming is happening at real-time. So looking for help in this regard.
Below is the code snippet :
# Create DataFrame representing the stream of input streamingDF from connection to ps.pndsn.com:9999
streamingDF = spark \
.readStream \
.format("socket") \
.option("host", "ps.pndsn.com") \
.option("port", 9999) \
.load()
# Is this DF actually a streaming DF?
streamingDF.isStreaming
spark.conf.set("spark.sql.shuffle.partitions", "2") # keep the size of shuffles small
query = (
streamingDF
.writeStream
.format("memory")
.queryName("temp_table") # temp_table = name of the in-memory table
.outputMode("Append") # Append = OutputMode in which only the new rows in the streaming DataFrame/Dataset will be written to the sink
.start()
)
Streaming output :
{'channel': 'pubnub-sensor-network',
'message': {'ambient_temperature': '1.361',
'humidity': '81.1392',
'photosensor': '758.82',
'radiation_level': '200',
'sensor_uuid': 'probe-84d85b75',
'timestamp': 1581332619},
'publisher': None,
'subscription': None,
'timetoken': 15813326199534409,
'user_metadata': None}
The output of the temp_table is empty.

Related

I am getting 0 records when reading batch data from kafka using spark

val spark: SparkSession = SparkSession.builder
.master("local[3]")
.appName("Kafka testing")
.config("spark.streaming.stopGracefullyOnShutdown", "true")
.getOrCreate()
val df = spark.read
.format("kafka")
.option("kafka.security.protocol", "SSL")
.option("kafka.ssl.truststore.location", "certs/truststore.jks")
.option("kafka.ssl.keystore.location", "certs/keystore.jks")
.option("kafka.ssl.key.password", "somePassword")
.option("kafka.ssl.keystore.password", "somePassword")
.option("kafka.ssl.truststore.password", "somePassword")
.option(
"kafka.bootstrap.servers",
"server1:16501,server2:16501,server3:16501"
)
.option("kafka.group.id", "dev-nimbus-udw")
.option("subscribe", "myTopic")
.option("startingOffsets","""{"myTopic":{"0":1193954,"1":1211438,"3":1203538,"2":1077955}}""" )
.option("endingOffsets", """{"myTopic":{"0":1193994,"1":1211478,"3":1203578,"2":1077995}}""")
.load()
//Just printing
println(s"The count is | The number of records read is ${df.count()}")
println(s"The number of partitions of this dataframe is ${df.rdd.getNumPartitions}")
df.select(col("value"))
.write
.format("json")
.mode(SaveMode.Overwrite)
.save("readSpecificCountsBatch")
I tried reading specific set of offsets data from kafka using spark structured streaming.
I specified the offsets range and i confirmed that they are present.
I am getting zero records in output folder.

Filter rows of snowflake table while reading in pyspark dataframe

I have a huge snowflake table. I want to do some transformation on the table in pyspark. My snowflake table has a column called 'snapshot'. I only want to read the current snapshot data in pyspark dataframe and do transformation on the filtered data.
So, Is there a way to apply filtering the rows while reading the snowflake table in spark dataframe (I don't want to read the entire snowflake table in memory because it is not efficient) or do I need to read entire snowflake table (in spark dataframe) and then apply filter to get the latest snapshot something as below?
SNOWFLAKE_SOURCE_NAME = "net.snowflake.spark.snowflake"
snowflake_database="********"
snowflake_schema="********"
source_table_name="********"
snowflake_options = {
"sfUrl": "********",
"sfUser": "********",
"sfPassword": "********",
"sfDatabase": snowflake_database,
"sfSchema": snowflake_schema,
"sfWarehouse": "COMPUTE_WH"
}
df = spark.read \
.format(SNOWFLAKE_SOURCE_NAME) \
.options(**snowflake_options) \
.option("dbtable",snowflake_database+"."+snowflake_schema+"."+source_table_name) \
.load()
df = df.where(df.snapshot == current_timestamp()).collect()
There are forms of filters (filter or where functionality of Spark DataFrame) that Spark doesn't pass to the Spark Snowflake connector. That means, in some situations, you may get more records than you expect.
The safest way would be to use a SQL query directly:
df = spark.read \
.format(SNOWFLAKE_SOURCE_NAME) \
.options(**snowflake_options) \
.option("query","SELECT X,Y,Z FROM TABLE1 WHERE SNAPSHOT==CURRENT_TIMESTAMP()") \
.load()
Of course, if you want to use filter/where functionality of Spark DataFrame, check the Query History in Snowflake UI to see if the query generated has the right filter applied.

Writestreaming query not writing data into table in databricks

Can someone help me with this issue.
I have a delta table "orders". this table is loaded with 1000 records using the delta file.
Now we are getting the streaming JSON file which is appending the data into this table.
Readstream dataframe **orderInputDF **:
from pyspark.sql.functions import *
orderInputDF = (spark
.readStream
.schema(jsonSchema)
.option("maxFilesPerTrigger", 1)
.json(stream_path).
withColumn("submitted_at",to_timestamp('submittedAt')).
withColumn("submitted_yyyy_mm", col('submitted_at').substr(0,7)).
withColumnRenamed("orderId",'order_id').
withColumnRenamed("customerId","customer_id").
withColumnRenamed("salesRepId","sales_rep_id").
withColumn("shipping_address_attention", col("shippingAddress.attention")).
withColumn("shipping_address_address", col("shippingAddress.address")).
withColumn("shipping_address_city", col("shippingAddress.city")).
withColumn("shipping_address_state", col("shippingAddress.state")).
withColumn("shipping_address_zip", col("shippingAddress.zip")).
withColumn("ingest_file_name",lit(files[0][0])).
withColumn("ingested_at",current_timestamp()).
drop("products","shippingAddress","submittedAt")
)
This readstream dataframe is returning 20 records.
Writestream Query is as below
orderOutputDF = (orderInputDF
.writeStream
.format("delta")
.format("memory")
.queryName(orders_table)
.outputMode("append")
.option("checkpointLocation", orders_checkpoint_path)
.start()
)
This query is not appending the data. Everytime it is inserting the new records and deleting the old records.
Can someone help me on this.

Multiple aggregations and Distinct Function in Spark Structured Streaming

I need to make some aggregations on streaming data from Kafka and output top 10 rows of result to console every M seconds.
input_df = (
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", brokers)
.option("subscribe", "page_views")
.load()
.selectExpr('cast(value as string)')
)
...
...
# info has 2 cols: domain, uid (info = transformation of input_df)
# It's an example of what I want to do (like in simple pyspark)
stat = (
info
.groupby('domain')
.agg(
F.count(F.col('UID')).alias('view'),
F.countDistinct(F.col('UID')).alias('unique')
)
.sort(F.col("view").desc())
.limit(10)
)
query = (
stat
.writeStream
.outputMode("complete")
.format("console")
.option("truncate", "true")
.start()
)
This example without time trigger, but I can make it by myself.
Because of it's not allowed to use countDistinct, I haven't ideas of making my exercise.
I tried to make 2 dfs for each aggregation(df_1 = (domain, view), df_2 = (domain, unique)) and then join df_1 with df_2, but it's also not allowed to have several aggregations. So it's dead end for me.
It will be cool to have decision for it.
Thanks for your attention!
You can make it by flatMapGroupWithState, which is arbitrary state function.Besides,it supports append mode and update mode.

How to join two streaming datasets when one dataset involves aggregation

I am getting below error in below code snippet -
Exception in thread "main" org.apache.spark.sql.AnalysisException:
Append output mode not supported when there are streaming aggregations
on streaming DataFrames/DataSets without watermark;;
Below is my input schema
val schema = new StructType()
.add("product",StringType)
.add("org",StringType)
.add("quantity", IntegerType)
.add("booked_at",TimestampType)
Creating streaming source dataset
val payload_df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test1")
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as payload")
.select(from_json(col("payload"), schema).as("data"))
.select("data.*")
Creating another streaming dataframe where aggregation is done and then joining it with original source dataframe to filter out records
payload_df.createOrReplaceTempView("orders")
val stage_df = spark.sql("select org, product, max(booked_at) as booked_at from orders group by 1,2")
stage_df.createOrReplaceTempView("stage")
val total_qty = spark.sql(
"select o.* from orders o join stage s on o.org = s.org and o.product = s.product and o.booked_at > s.booked_at ")
Finally, I was trying to display results on console with Append output mode. I am not able to figure out where I need to add watermark or how to resolve this. My objective is to filter out only those events in every trigger which have higher timestamp then the maximum timestamp received in any of the earlier triggers
total_qty
.writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination()
With spark structured streaming you can make aggregation directly on stream only with watermark. If you have a column with the timestamp of the event you can do it like this:
val payload_df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test1")
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as payload")
.select(from_json(col("payload"), schema).as("data"))
.select("data.*")
.withWatermark("event_time", "1 minutes")
On queries with aggregation you have 3 types of outputs:
Append mode uses watermark to drop old aggregation state. But the output of a windowed aggregation is delayed the late threshold
specified in withWatermark() as by the modes semantics, rows can be
added to the Result Table only once after they are finalized (i.e.
after watermark is crossed). See the Late Data section for more
details.
Update mode uses watermark to drop old aggregation state.
Complete mode does not drop old aggregation state since by definition this mode preserves all data in the Result Table.
Edit later:
You have to add window on your groupBy method.
val aggFg = payload_df.groupBy(window($"event_time", "1 minute"), $"org", $"product")
.agg(max(booked_at).as("booked_at"))