How to use maxOffsetsPerTrigger in pyspark structured streaming? - pyspark

I want to limit the rate when fetching data from kafka. My code looks like:
df = spark.read.format('kafka') \
.option("kafka.bootstrap.servers",'...')\
.option("subscribe",'A') \
.option("startingOffsets",'''{"A":{"0":200,"1":200,"2":200}}''') \
.option("endingOffsets",'''{"A":{"0":400,"1":400,"2":400}}''') \
.option("maxOffsetsPerTrigger",20) \
.load() \
.cache()
However when I call df.count(), the result is 600. What I expected is 20. Does anyone knows why "maxOffsetsPerTrigger" doesn't work.

You are bringing 200 records per each partition (0, 1, 2), the total is 600 records.
As you can see here:
Use maxOffsetsPerTrigger option to limit the number of records to
fetch per trigger.
This means that for each trigger or fetch process Kafka will get 20 records, but in total, you will still fetch the total records set in the configuration (200 per partition).

Related

pyspark - will partition option in autoloader->writesteam partitioned for existing table data?

i used autoloader to read data file and write it to table periodically(without partition at first) by below code:
.writeStream\
.option("checkpointLocation", "path") \
.format("delta")\
.outputMode("append")\
.start("table")
Now data size is growing, and want to partition the data with adding this option " .partitionBy("col1") "
.writeStream\
.option("checkpointLocation", "path") \
.partitionBy("col1")\
.format("delta")\
.outputMode("append")\
.start("table")
I want to ask if this option partitionBy("col1") will partition the existing data in the table? If not, how to partition all the data (include existing data and new data ingested)
No, it wont' partition existing data automatically, you will need to do it explicitly. Something like this, test first on a small dataset:
Stop stream if it's running continuously
Read existing data and overwrite with the new partitioning schema
spark.read.table("table") \
.write.mode("overwrite")\
.partitionBy("col1")\
.option("overwriteSchema", "true") \
.saveAsTable("table")
Start stream again

Total records processed in each micro batch spark streaming

Is there a way I can find how many records got processed into downstream delta table for each micro-batch. I've streaming job, which runs hourly once using trigger.once() with the append mode. For audit purpose, I want to know how many records got processed for each micro batch. I've tried the below code to print the count of records processed(shown in the second line).
ss_count=0
def write_to_managed_table(micro_batch_df, batchId):
#print(f"inside foreachBatch for batch_id:{batchId}, rows in passed dataframe: {micro_batch_df.count()}")
ss_count = micro_batch_df.count()
saveloc = "TABLE_PATH"
df_final.writeStream.trigger(once=True).foreachBatch(write_to_managed_table).option('checkpointLocation', f"{saveloc}/_checkpoint").start(saveloc)
print(ss_count)
Streaming job will run without any issues but micro_batch_df.count() will not print any count.
Any pointers would be much appreciated.
Here is a working example of what you are looking for (structured_steaming_example.py):
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("StructuredStreamTesting") \
.getOrCreate()
# Create DataFrame representing the stream of input
df = spark.read.parquet("data/")
lines = spark.readStream.schema(df.schema).parquet("data/")
def batch_write(output_df, batch_id):
print("inside foreachBatch for batch_id:{0}, rows in passed dataframe: {1}".format(batch_id, output_df.count()))
save_loc = "/tmp/example"
query = (lines.writeStream.trigger(once=True)
.foreachBatch(batch_write)
.option('checkpointLocation', save_loc + "/_checkpoint")
.start(save_loc)
)
query.awaitTermination()
The sample parquet file is attached. Please put that in the data folder and execute the code using spark-submit
spark-submit --master local structured_steaming_example.py
Please put any sample parquet file under data folder for testing.

PySpark structured streaming and filtered processing for parts

I want to evaluate a streamed (unbound) data frame within Spark 2.4:
time id value
6:00:01.000 1 333
6:00:01.005 1 123
6:00:01.050 2 544
6:00:01.060 2 544
When all the data of id 1 got into the dataframe and the data of the next id 2 comes I want to do calculations for the complete data of id 1. But how do I do that? I think I cannot use the window functions since I do not know the time in advance that also varies for each id. And I also do not know the id from other sources besides the streamed data frame.
The only solution that come to my mind contains variable comparison (a memory) and a while loop:
id_old = 0 # start value
while true:
id_cur = id_from_dataframe
if id_cur != id_old: # id has changed
do calulation for id_cur
id_old = id_cur
But I do not think that this is the right solution. Can you give me a hint or documentation that helps me since I cannot find examples or documentation.
I get it running with a combination of watermarking and grouping:
import pyspark.sql.functions as F
d2 = d1.withWatermark("time", "60 second") \
.groupby('id', \
F.window('time', "40 second")) \
.agg(
F.count("*").alias("count"), \
F.min("time").alias("time_start"), \
F.max("time").alias("time_stop"), \
F.round(F.avg("value"),1).alias('value_avg'))
Most of the documentation only shows the basic stuff with grouping only by time and I saw just one example with another parameter for grouping, so I put my 'id' there.

Spark Scala performance issues with take and InsertInto commands

Please look at the attached screenshot.
I am trying to do some performance improvement to my spark job and its taking almost 5 min to execute the take action on dataframe.
I am using take for making sure that dataframe has some records in it and if it is present, I want to proceed for further processing.
I tried take and count and don't see much difference in the time for execution.
Another scenario is where its taking around 10min to write the datafrane into hive table(it has max 200 rows and 10 columns).
df.write.mode("append").partitionBy("date").insertInto(tablename)
Please suggest how we can minimize the time its taking for take and insert into hive table.
Updates:
Here is my spark submit : spark-submit --master yarn-cluster --class com.xxxx.info.InfoAssets --conf "spark.executor.extraJavaOptions=-XX:+UseCompressedOops -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -Djava.security.auth.login.config=kafka_spark_jaas.conf" --files /home/ngap.app.rcrp/hive-site.xml,/home//kafka_spark_jaas.conf,/etc/security/keytabs/ngap.sa.rcrp.keytab --jars /usr/hdp/current/spark-client/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/current/spark-client/lib/datanucleus-core-3.2.10.jar,/usr/hdp/current/spark-client/lib/datanucleus-rdbms-3.2.9.jar --executor-memory 3G --num-executors 3 --executor-cores 10 /home/InfoAssets/InfoAssets.jar
Code details:
its a simple dataframe which has 8 columns with around 200 records in it and I am using following code to insert into hive table.
df.write.mode("append").partitionBy("partkey").insertInto(hiveDB + "." + tableName)
Thanks,Bab
Don't use count before the write if not necessary and if your table is already created then use Spark SQL to insert the data into Hive Partitioned table.
spark.sql("Insert into <tgt tbl> partition(<col name>) select cols,partition col from temp_tbl")

Spark Dataframe returns an inconsistent value on count()

I am using pypark to perform some computations on data obtained from a PostgreSQL database. My pipeline is something similar to this:
limit = 1000
query = "(SELECT * FROM table LIMIT {}) as filter_query"
df = spark.read.format("jdbc") \
.option("url", "jdbc:postgresql://path/to/db") \
.option("dbtable", query.format(limit)) \
.option("user", "user") \
.option("password", "password") \
.option("driver", "org.postgresql.Driver")
df.createOrReplaceTempView("table")
df.count() # 1000
So far, so good. The problem starts when I perform some transformations on the data:
counted_data = spark.sql("SELECT column1, count(*) as count FROM table GROUP BY column1").orderBy("column1")
counted_data.count() # First value
counted_data_with_additional_column = counted_data.withColumn("column1", my_udf_function)
counted_data_with_additional_column.count() # Second value, inconsistent with the first count (should be the same)
The first transformation alters the number of rows, (the value should be <= 1000). However, the second one does not, it just adds a new column. How can it be that I am getting a different result for count()?
The explanation is actually quite simple, but a bit tricky. Spark might perform additional reads to the input source (in this case a database). Since some other process is inserting data in the database, these additional calls read slightly different data than the original read, causing this inconsistent behaviour. A simple call to df.cache() after the read disables further reads. I figured this out by analyzing the traffic between the database and my computer, and indeed, some further SQL commands where issued that matched my transformations. After adding the cache() call, no further traffic appeared.
Since you are using Limit 1000, you might be getting different 1000 records on each execution. And since you will be getting different records each time, the result of aggregation will be different. In order to get the consistent behaviour with Limit you can try following approaches.
Either try to cache your dataframe with cahce() or Persist method, which will ensure that spark will use same data till the time it will be available in memory.
But better approach could be to sort the data based on some unique column and then get the 1000 records, which will ensure that you will get the same 1000 records each time.
Hope it helps.