Databricks: writeStream not processing data - pyspark

I am working on a Databricks training, having a hard time to get a writeStream query to work. Maybe I am not catching the whole concept of streaming correctly. I have a path with 20 JSON files, which I am able to read, by doing:
ordersJsonPath = "dbfs:/user/dbacademy/developer-foundations-capstone/raw/orders/stream/*"
ordersDF = (spark.readStream
.option("maxFilesPerTrigger", 1)
.schema(userDefinedSchema)
.json(ordersJsonPath)
When running "display(ordersDF)", I can see that the 20 JSON get added sequentially to the dataframe with the correct schema. However, when I want to write the files in a table with the same schema, nothing gets processed. My code for the streamWrite is:
checkpointPath = "dbfs:/user/dbacademy/developer-foundations-capstone/checkpoint/orders"
orders = (ordersDF.writeStream
.format("delta")
.queryName("ordersQuery")
.outputMode("append")
.trigger(processingTime="1 second")
.option("checkpointLocation", checkpointPath)
.table("orders"))
The writeStream query runs but does not show any result (and the table does not get updated). Since I am not getting any error message, it is hard so say whats wrong, but it appears there is just no real connection between the read and the write query. Do I need to run both queries at the same time or sequentially? Or am I confusing things here and I need 2 different dataframes for read and write?
Thanks a lot!

The solution was to delete the checkPointPath and run it again.

Related

Pyspark updating a particular partition of an external Hive table

I am trying to overwrite a particular partition of a hive table using pyspark but each time i am trying to do that, all the other partitions are getting wiped off. I went through couple of posts in here regarding this and implemented the steps but seems like i am still getting and error. the code i am using is
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
spark.conf.set("hive.exec.dynamic.partition", "true")
spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
df.write.format('parquet').mode('overwrite').partitionBy('col1').option("partitionOverwriteMode", "dynamic").saveAsTable(op_dbname+'.'+op_tblname)
Initially the partitions are like col1=m and col1=n and while i am trying to overwrite only the partition col1=m its wiping our col1=n as well.
Spark version is 2.4.4
Appreciate any help.
After multiple trial and error this is the method that i tried
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
op_df.write.format(op_fl_frmt).mode(operation_mode).option("partitionOverwriteMode", "dynamic").insertInto(op_dbname+'.'+op_tblname,overwrite=True)
When i tried to use the saveAsTable no matter what i do it is always wiping off all the values. And setting only the flag 'spark.sql.sources.partitionOverwriteMode' to dynamic doesnt seem to work. Hence using insertInto along with an overwrite flag inside that to achieve the desired output.

Continuously Updating Partitioned Parquet

I have a Spark script that pulls data from a database and writes it to S3 in parquet format. The parquet data is partitioned by date. Because of the size of the table, I'd like to run the script daily and have it just rewrite the most recent few days of data (redundancy because data may change for a couple days).
I'm wondering how I can go about writing the data to s3 in a way that only overwrites the partitions of the days I'm working with. SaveMode.Overwrite unfortunately wipes everything before it, and the other save modes don't seem to be what I'm looking for.
Snippet of my current write:
table
.filter(row => row.ts.after(twoDaysAgo)) // update most recent 2 days
.withColumn("date", to_date(col("ts"))) // add a column with just date
.write
.mode(SaveMode.Overwrite)
.partitionBy("date") // use the new date column to partition the parquet output
.parquet("s3a://some-bucket/stuff") // pick a parent directory to hold the parquets
Any advice would be much appreciated, thanks!
The answer I was looking for was Dynamic Overwrite, detailed in this article. Short answer, adding this line fixed my problem:
sparkConf.set("spark.sql.sources.partitionOverwriteMode", "DYNAMIC")

how can I see the multiple queries that gets generated for reading data for each partition in parallel from database using spark

I am trying to read data from Postgres table using Spark. Initially I was reading the data on the single thread without using lowerBound, upperBound, partitionColumn and numPartitions. The data that I'm reading is huge, around 120 Million records. So I decided to read the data in parallel using partitionColumn. I am able to read the data but it is taking more time to read it by 12 parallel threads than by a single thread. I am unable to figure out how can I see the 12 SQL queries that gets generated to fetch the data in parallel for each partition.
The code that I am using is:
val query = s"(select * from db.testtable) as testquery"
val df = spark.read
.format("jdbc")
.option("url", jdbcurl)
.option("dbtable", query)
.option("partitionColumn","transactionbegin")
.option("numPartitions",12)
.option("driver", "org.postgresql.Driver")
.option("fetchsize", 50000)
.option("user","user")
.option("password", "password")
.option("lowerBound","2019-01-01 00:00:00")
.option("upperBound","2019-12-31 23:59:00")
.load
df.count()
Where and how can I see the 12 parallel queries that are getting created to read the data in parallel on each thread?
I am able to see that 12 tasks are created in the Spark UI but not able to find a way to see what separate 12 queries are generated to fetch data in parallel from the Postgres table.
Is there any way I can push the filter down so that it reads only this year worth of data, in this case 2019.
The SQL statement is printed using "info" log level, see here. You need to change Spark's log level to "info" to see the SQL. Additionally it printed the where condition alone too as here.
You can also view the SQL in your Postgresql database using pg_stat_statements view which requires a separate plugin to be installed. There is a way to log the SQLs and see them as mentioned here.
I suspect the parallelism is slow for you because there is no index on the "transactionbegin" column of your table. The partitionColumn must be indexed otherwise it will scan the entire table in all those parallel sessions which will choke.
It's not exactly multiple queries, but it will actually show the plan of execution that Spark has optimized based on your queries. It may not be perfect depending on stages you have to execute.
You can write your dag in the form of DataFrame and before actually calling an action, you can use explain() method on it. Reading it can be challenging, but it's upside down. Source is on the bottom while reading this. It may seem little bit unusual if you try to read, so start with basic transformations and go step by step if you're reading first time.

Results of rdd.count, count via spark sql are the same, but they are different from count with hive sql

I use count to calculate the number of RDD,got 13673153,but after I transfer the rdd to df and insert into hive,and count again,got 13673182,why?
rdd.count
spark.sql("select count(*) from ...").show()
hive sql: select count(*) from ...
This could be caused by a mismatch between data in the underlying files and the metadata registered in hive for that table. Try running:
MSCK REPAIR TABLE tablename;
in hive, and see if the issue is fixed. The command updates the partition information of the table. You can find more info in the documentation here.
During a Spark Action and part of SparkContext, Spark will record which files were in scope for processing. So, if the DAG needs to recover and reprocess that Action, then the same results are gotten. By design.
Hive QL has no such considerations.
UPDATE
As you noted, the other answer did not help in this use case.
So, when Spark processes Hive tables it looks at the list of files that it will use for the Action.
In the case of a failure (node failure, etc.) it will recompute data from the generated DAG. If it needs to go back and re-compute as far as the start of reading from Hive itself, then it will know which files to use - i.e the same files, so that same results are gotten instead of non-deterministic outcomes. E.g. think of partitioning aspects, handy that same results can be recomputed!
It's that simple. It's by design. Hope this helps.

Overwriting the parquet file throws exception in spark

I am trying to read the parquet file from hdfs location, do some transformations and overwrite the file in the same location. I had to overwrite the file in the same location because I had to run the same code multiple times.
Here is the code I have written
val df = spark.read.option("header", "true").option("inferSchema", "true").parquet("hdfs://master:8020/persist/local/")
//after applying some transformations lets say the final dataframe is transDF which I want to overwrite at the same location.
transDF.write.mode("overwrite").parquet("hdfs://master:8020/persist/local/")
Now the problem is before reading the parquet file from the given location, spark for some reason I believe it deletes the file at the given location because of overwrite mode. So when executing the code I get the following error.
File does not exist: hdfs://master:8020/persist/local/part-00000-e73c4dfd-d008-4007-8274-d445bdea3fc8-c000.snappy.parquet
Any suggestions on how to solve this problem? Thanks.
The simple answer is that you cannot overwrite what you are reading. The reason behind this is that overwrite would need to delete everything, however, since spark is working in parallel, some portions might still be reading at the time. Furthermore, even if everything was read, spark needs the original file to recalculate tasks which are failed.
Since you need the input for multiple iterations, I would simply make the name of the input and the output into arguments for the function that does one iteration and delete the previous iteration only once the writing is successful.
This is what I have tried and it worked. My requirement was almost same. It was upsert option.
by the way, spark.conf.set('spark.sql.sources.partitionOverwriteMode', 'dynamic') property was set. Even then also the Transform job was failing
Took a backup of S3 folder (final curated layer) before every batch operation
using the dataframe operations, first delete the S3 parquet file location before overwrite
then Append to the particular location
Previously the entire job was running for 1.5Hrs and failing frequently. Now it's taking 10-15mins for the entire operations