Appending data to a automatic partitioned dataframe stored externally as parquet files - pyspark

Trying to write auto partitioned Dataframe data on an attribute to external store in append mode overwrites the parquet files.
I have a huge amount of data that I cannot load in one go. So I am reading data a folder at a time in a loop. In every iteration, I partition the data based on a certain attribute and use saveAsTable to write the parquet files on Amazon S3. I am finding that my s3 folder is getting wiped out at every iteration. I want to add on the data from every iteration to my hive store in partitioned folders, so I can categorize the data and can read only the category I want to work on.
This is the command I am using to save the dataframe.
DF.write.partitionBy('Type').format('parquet').mode("append").saveAsTable(('AllComponents', path='s3a://xxx/<Path>')
for Pos1 in HexKey:
folderKey = "{}".format(Pos1)
spark = SparkSession.builder \
.getOrCreate()
if DataSetSchema is None:
log.warn("Reviewing schema")
AllComponentsDF = spark.read \
.format('com.databricks.spark.xml') \
.load('s3a://location' + folderKey + '0/00/*')
DataSetSchema = AllComponentsDF.schema
else:
log.warn("Reading folder {}".format(Pos1))
AllComponentsDF = spark.read \
.format('com.databricks.spark.xml') \
.load('s3a://location/' + folderKey + '0/00/*', schema=DataSetSchema)
AllComponentsDF.write.partitionBy('Type').format('parquet').mode("append").saveAsTable(('AllComponents', path='s3a://spark-cluster-boomi/AllComponents')

Related

pyspark - will partition option in autoloader->writesteam partitioned for existing table data?

i used autoloader to read data file and write it to table periodically(without partition at first) by below code:
.writeStream\
.option("checkpointLocation", "path") \
.format("delta")\
.outputMode("append")\
.start("table")
Now data size is growing, and want to partition the data with adding this option " .partitionBy("col1") "
.writeStream\
.option("checkpointLocation", "path") \
.partitionBy("col1")\
.format("delta")\
.outputMode("append")\
.start("table")
I want to ask if this option partitionBy("col1") will partition the existing data in the table? If not, how to partition all the data (include existing data and new data ingested)
No, it wont' partition existing data automatically, you will need to do it explicitly. Something like this, test first on a small dataset:
Stop stream if it's running continuously
Read existing data and overwrite with the new partitioning schema
spark.read.table("table") \
.write.mode("overwrite")\
.partitionBy("col1")\
.option("overwriteSchema", "true") \
.saveAsTable("table")
Start stream again

Total records processed in each micro batch spark streaming

Is there a way I can find how many records got processed into downstream delta table for each micro-batch. I've streaming job, which runs hourly once using trigger.once() with the append mode. For audit purpose, I want to know how many records got processed for each micro batch. I've tried the below code to print the count of records processed(shown in the second line).
ss_count=0
def write_to_managed_table(micro_batch_df, batchId):
#print(f"inside foreachBatch for batch_id:{batchId}, rows in passed dataframe: {micro_batch_df.count()}")
ss_count = micro_batch_df.count()
saveloc = "TABLE_PATH"
df_final.writeStream.trigger(once=True).foreachBatch(write_to_managed_table).option('checkpointLocation', f"{saveloc}/_checkpoint").start(saveloc)
print(ss_count)
Streaming job will run without any issues but micro_batch_df.count() will not print any count.
Any pointers would be much appreciated.
Here is a working example of what you are looking for (structured_steaming_example.py):
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("StructuredStreamTesting") \
.getOrCreate()
# Create DataFrame representing the stream of input
df = spark.read.parquet("data/")
lines = spark.readStream.schema(df.schema).parquet("data/")
def batch_write(output_df, batch_id):
print("inside foreachBatch for batch_id:{0}, rows in passed dataframe: {1}".format(batch_id, output_df.count()))
save_loc = "/tmp/example"
query = (lines.writeStream.trigger(once=True)
.foreachBatch(batch_write)
.option('checkpointLocation', save_loc + "/_checkpoint")
.start(save_loc)
)
query.awaitTermination()
The sample parquet file is attached. Please put that in the data folder and execute the code using spark-submit
spark-submit --master local structured_steaming_example.py
Please put any sample parquet file under data folder for testing.

Spark sql Optimization Techniques loading csv to orc format of hive

Hi I have 90 GB data In CSV file I'm loading this data into one temp table and then from temp table to orc table using select insert command but for converting and loading data into orc format its taking 4 hrs in spark sql.Is there any kind of optimization technique which i can use to reduce this time.As of now I'm not using any kind of optimization technique I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select insert)
using spark submit as:
spark-submit \
--class class-name\
--jar file
or can I add any extra Parameter in spark submit for improving the optimization.
scala code(sample):
All Imports
object demo {
def main(args: Array[String]) {
//sparksession with enabled hivesuppport
var a1=sparksession.sql("load data inpath 'filepath' overwrite into table table_name")
var b1=sparksession.sql("insert into tablename (all_column) select 'ALL_COLUMNS' from source_table")
}
}
I'm just using spark sql and loading data from csv file to
table(textformat) and then from this temp table to orc table(using
select insert)
2 step process is not needed here..
Read the dataframe like below sample...
val DFCsv = spark.read.format("csv")
.option("sep", ",")
.option("inferSchema", "true")
.option("header", "true")
.load("yourcsv")
if needed you have to do repartition(may be this is cause of the actual 4hr delay since you have not done) since its large file and then...
dfcsv.repartition(90) means it will/may repartition the csv data in to 90 almost equal parts. where 90 is sample number. you can mention what ever you want.
DFCsv.write.format("orc")
.partitionBy('yourpartitioncolumns')
.saveAsTable('yourtable')
OR
DFCsv.write.format("orc")
.partitionBy('yourpartitioncolumns')
.insertInto('yourtable')
Note: 1) For large data you need to do repartition to uniformly distribute the data will increase the parllelism and hence
performance.
2) If you dont have patition columns and is
non-partition table then no need of partitionBy in the above
samples

Spark-Optimization Techniques

Hi I have 90 GB data In CSV file I'm loading this data into one temp table and then from temp table to orc table using select insert command but for converting and loading data into orc format its taking 4 hrs in spark sql.Is there any kind of optimization technique which i can use to reduce this time.As of now I'm not using any kind of optimization technique I'm just using spark sql and loading data from csv file to table(textformat) and then from this temp table to orc table(using select insert)
using spark submit as:
spark-submit \
--class class-name\
--jar file
or can I add any extra Parameter in spark submit for improving the optimization.
scala code(sample):
All Imports
object sample_1 {
def main(args: Array[String]) {
//sparksession with enabled hivesuppport
var a1=sparksession.sql("load data inpath 'filepath' overwrite into table table_name")
var b1=sparksession.sql("insert into tablename (all_column) select 'ALL_COLUMNS' from source_table")
}
}
First of all, you don't need to store the data in the temp table to write into hive table later. You can straightaway read the file and write the output using the DataFrameWriter API. This will reduce one step from your code.
You can write as follows:
val spark = SparkSession.builder.enableHiveSupport().getOrCreate()
val df = spark.read.csv(filePath) //Add header or delimiter options if needed
inputDF.write.mode("append").format(outputFormat).saveAsTable(outputDB + "." + outputTableName)
Here, the outputFormat will be orc, the outputDB will be your hive database and outputTableName will be your Hive table name.
I think using the above technique, your write time will reduce significantly. Also, please mention the resources your job is using and I may be able to optimize it further.
Another optimization you can use is to partition your dataframe while writing. This will make the write operation faster. However, you need to decide the columns on which to partition carefully so that you don't end up creating a lot of partitions.

Group Cassandra Rows Then Write As Parquet File Using Spark

I need to write Cassandra Partitions as parquet file. Since I cannot share and use sparkSession in foreach function. Firstly, I call collect method to collect all data in driver program then I write parquet file to HDFS, as below.
Thanks to this link https://github.com/datastax/spark-cassandra-connector/blob/master/doc/16_partitioning.md
I am able to get my partitioned rows. I want to write partitioned rows into seperated parquet file, whenever a partition is read from cassandra table. I also tried sparkSQLContext that method writes task results as temporary. I think, after all the tasks are done. I will see parquet files.
Is there any convenient method for this?
val keyedTable : CassandraTableScanRDD[(Tuple2[Int, Date], MyCassandraTable)] = getTableAsKeyed()
keyedTable.groupByKey
.collect
.foreach(f => {
import sparkSession.implicits._
val items = f._2.toList
val key = f._1
val baseHDFS = "hdfs://mycluster/parquet_test/"
val ds = sparkSession.sqlContext.createDataset(items)
ds.write
.option("compression", "gzip")
.parquet(baseHDFS + key._1 + "/" + key._2)
})
Why not use Spark SQL everywhere & use built-in functionality of the Parquet to write data by partitions, instead of creating a directory hierarchy yourself?
Something like this:
import org.apache.spark.sql.cassandra._
val data = spark.read.cassandraFormat("table", "keyspace").load()
data.write
.option("compression", "gzip")
.partitionBy("col1", "col2")
.parquet(baseHDFS)
In this case, it will create a separate directory for every value of col & col2 as nested directories, with name like this: ${column}=${value}. Then when you read, you may force to read only specific value.