pyspark - will partition option in autoloader->writesteam partitioned for existing table data? - pyspark

i used autoloader to read data file and write it to table periodically(without partition at first) by below code:
.writeStream\
.option("checkpointLocation", "path") \
.format("delta")\
.outputMode("append")\
.start("table")
Now data size is growing, and want to partition the data with adding this option " .partitionBy("col1") "
.writeStream\
.option("checkpointLocation", "path") \
.partitionBy("col1")\
.format("delta")\
.outputMode("append")\
.start("table")
I want to ask if this option partitionBy("col1") will partition the existing data in the table? If not, how to partition all the data (include existing data and new data ingested)

No, it wont' partition existing data automatically, you will need to do it explicitly. Something like this, test first on a small dataset:
Stop stream if it's running continuously
Read existing data and overwrite with the new partitioning schema
spark.read.table("table") \
.write.mode("overwrite")\
.partitionBy("col1")\
.option("overwriteSchema", "true") \
.saveAsTable("table")
Start stream again

Related

Hudi data overrides every time on new batch of spark structure streaming

I am working on spark structure streaming where job consuming Kafka message, do aggregation and save data in apache hudi table every 10 seconds. The below code is working fine but it overwrites the resultant apache hudi table data on every batch. I do not yet figure out why it is happening? Is it spark structure streaming or hudi behavior? I am using MERGE_ON_READ so the table file should not delete on every update. But don't know why it is happening? Due to this issue, my other job failed which read this table.
spark.readStream
.format('kafka')
.option("kafka.bootstrap.servers",
"localhost:9092")
...
...
df1 = df.groupby('a', 'b', 'c').agg(sum('d').alias('d'))
df1.writeStream
.format('org.apache.hudi')
.option('hoodie.table.name', 'table1')
.option("hoodie.datasource.write.table.type", "MERGE_ON_READ")
.option('hoodie.datasource.write.keygenerator.class', 'org.apache.hudi.keygen.ComplexKeyGenerator')
.option('hoodie.datasource.write.recordkey.field', "a,b,c")
.option('hoodie.datasource.write.partitionpath.field', 'a')
.option('hoodie.datasource.write.table.name', 'table1')
.option('hoodie.datasource.write.operation', 'upsert')
.option('hoodie.datasource.write.precombine.field', 'c')
.outputMode('complete')
.option('path', '/Users/lucy/hudi/table1')
.option("checkpointLocation",
"/Users/lucy/checkpoint/table1")
.trigger(processingTime="10 second")
.start()
.awaitTermination()
Based on your configurations, the explanation for this problem may be that you read the same keys at each batch (the same a, b, c with different value of d), and where you have an upsert operation, hudi relace the old values by the new one. Try using insert instead of upsert or modify the hudi key depending on what you want to do.

Total records processed in each micro batch spark streaming

Is there a way I can find how many records got processed into downstream delta table for each micro-batch. I've streaming job, which runs hourly once using trigger.once() with the append mode. For audit purpose, I want to know how many records got processed for each micro batch. I've tried the below code to print the count of records processed(shown in the second line).
ss_count=0
def write_to_managed_table(micro_batch_df, batchId):
#print(f"inside foreachBatch for batch_id:{batchId}, rows in passed dataframe: {micro_batch_df.count()}")
ss_count = micro_batch_df.count()
saveloc = "TABLE_PATH"
df_final.writeStream.trigger(once=True).foreachBatch(write_to_managed_table).option('checkpointLocation', f"{saveloc}/_checkpoint").start(saveloc)
print(ss_count)
Streaming job will run without any issues but micro_batch_df.count() will not print any count.
Any pointers would be much appreciated.
Here is a working example of what you are looking for (structured_steaming_example.py):
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("StructuredStreamTesting") \
.getOrCreate()
# Create DataFrame representing the stream of input
df = spark.read.parquet("data/")
lines = spark.readStream.schema(df.schema).parquet("data/")
def batch_write(output_df, batch_id):
print("inside foreachBatch for batch_id:{0}, rows in passed dataframe: {1}".format(batch_id, output_df.count()))
save_loc = "/tmp/example"
query = (lines.writeStream.trigger(once=True)
.foreachBatch(batch_write)
.option('checkpointLocation', save_loc + "/_checkpoint")
.start(save_loc)
)
query.awaitTermination()
The sample parquet file is attached. Please put that in the data folder and execute the code using spark-submit
spark-submit --master local structured_steaming_example.py
Please put any sample parquet file under data folder for testing.

spark dataset overwrite particular partition not working in spark 2.4

In my job final step is to store the executed data in Hive table with partition on "date" column.
Sometime, due to job fail, I need to re-run job for particular partition alone.
As observed, when I use below code, spark overrides all the partitions when using overwrite mode.
ds.write.partitionBy("date").mode("overwrite").saveAsTable("test.someTable")
After going through multiple blogs and stackoverflow, I followed below steps to overwrite particular partitions only.
Step 1: Enbable dynamic partition for overwrite mode
spark.conf.set("spark.sql.sources.partitionOverWriteMode", "dynamic")
Step 2: write dataframe to hive table using saveToTable
Seq(("Company1", "A"),
("Company2","B"))
.toDF("company", "id")
.write
.mode(SaveMode.Overwrite)
.partitionBy("id")
.saveAsTable(targetTable)
spark.sql(s"SELECT * FROM ${targetTable}").show(false)
spark.sql(s"show partitions ${targetTable}").show(false)
Seq(("CompanyA3", "A"))
.toDF("company", "id")
.write
.mode(SaveMode.Overwrite)
.insertInto(targetTable)
spark.sql(s"SELECT * FROM ${targetTable}").show(false)
spark.sql(s"show partitions ${targetTable}").show(false)
Still it overwrite all the partitions.
As per this blog, https://www.waitingforcode.com/apache-spark-sql/apache-spark-sql-hive-insertinto-command/read, "insertinto" should overwrite only particular partitions
If I create table first and then use "insertinto" method, it working fine
Set required configuration,
Step 1: Create table
Step 2: Add data using insertinto method
Step 3: Overwrite paritition
I wanted to know, what is difference between creating hive table via SaveToTable and creating table manually? Why it is not working in first scenario?
Could any one help me in this?
Try with lowercase w!
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
not
spark.conf.set("spark.sql.sources.partitionOverWriteMode", "dynamic")
It fooled me. You have 2 variations in use in your scripting if you look.
My original answer deprecated it appears.

Appending data to a automatic partitioned dataframe stored externally as parquet files

Trying to write auto partitioned Dataframe data on an attribute to external store in append mode overwrites the parquet files.
I have a huge amount of data that I cannot load in one go. So I am reading data a folder at a time in a loop. In every iteration, I partition the data based on a certain attribute and use saveAsTable to write the parquet files on Amazon S3. I am finding that my s3 folder is getting wiped out at every iteration. I want to add on the data from every iteration to my hive store in partitioned folders, so I can categorize the data and can read only the category I want to work on.
This is the command I am using to save the dataframe.
DF.write.partitionBy('Type').format('parquet').mode("append").saveAsTable(('AllComponents', path='s3a://xxx/<Path>')
for Pos1 in HexKey:
folderKey = "{}".format(Pos1)
spark = SparkSession.builder \
.getOrCreate()
if DataSetSchema is None:
log.warn("Reviewing schema")
AllComponentsDF = spark.read \
.format('com.databricks.spark.xml') \
.load('s3a://location' + folderKey + '0/00/*')
DataSetSchema = AllComponentsDF.schema
else:
log.warn("Reading folder {}".format(Pos1))
AllComponentsDF = spark.read \
.format('com.databricks.spark.xml') \
.load('s3a://location/' + folderKey + '0/00/*', schema=DataSetSchema)
AllComponentsDF.write.partitionBy('Type').format('parquet').mode("append").saveAsTable(('AllComponents', path='s3a://spark-cluster-boomi/AllComponents')

Databricks - failing to write from a DataFrame to a Delta location

I wanted to change a column name of a Databricks Delta table.
So I did the following:
// Read old table data
val old_data_DF = spark.read.format("delta")
.load("dbfs:/mnt/main/sales")
// Created a new DF with a renamed column
val new_data_DF = old_data_DF
.withColumnRenamed("column_a", "metric1")
.select("*")
// Dropped and recereated the Delta files location
dbutils.fs.rm("dbfs:/mnt/main/sales", true)
dbutils.fs.mkdirs("dbfs:/mnt/main/sales")
// Trying to write the new DF to the location
new_data_DF.write
.format("delta")
.partitionBy("sale_date_partition")
.save("dbfs:/mnt/main/sales")
Here I'm getting an Error at the last step when writing to Delta:
java.io.FileNotFoundException: dbfs:/mnt/main/sales/sale_date_partition=2019-04-29/part-00000-769.c000.snappy.parquet
A file referenced in the transaction log cannot be found. This occurs when data has been manually deleted from the file system rather than using the table `DELETE` statement
Obviously the data was deleted and most likely I've missed something in the above logic. Now the only place that contains the data is the new_data_DF.
Writing to a location like dbfs:/mnt/main/sales_tmp also fails
What should I do to write data from new_data_DF to a Delta location?
In general, it is a good idea to avoid using rm on Delta tables. Delta's transaction log can prevent eventual consistency issues in most cases, however, when you delete and recreate a table in a very short time, different versions of the transaction log can flicker in and out of existence.
Instead, I'd recommend using the transactional primitives provided by Delta. For example, to overwrite the data in a table you can:
df.write.format("delta").mode("overwrite").save("/delta/events")
If you have a table that has already been corrupted, you can fix it using FSCK.
You could do that in the following way.
// Read old table data
val old_data_DF = spark.read.format("delta")
.load("dbfs:/mnt/main/sales")
// Created a new DF with a renamed column
val new_data_DF = old_data_DF
.withColumnRenamed("column_a", "metric1")
.select("*")
// Trying to write the new DF to the location
new_data_DF.write
.format("delta")
.mode("overwrite") // this would overwrite the whole data files
.option("overwriteSchema", "true") //this is the key line.
.partitionBy("sale_date_partition")
.save("dbfs:/mnt/main/sales")
OverWriteSchema option will create new physical files with latest schema that we have updated during transformation.