Write Spark Dataset to Excel File along with partitioning - scala

I have a Dataset similar to the below structure:
col_A col_B date
1 5 2021-04-14
2 7 2021-04-14
3 5 2021-04-14
4 9 2021-04-14
I am trying to use the below code in Spark Java to write the dataaset to a file in HDFS.
Dataset<Row> outputDataset; // This is a valid dataset and works flawlessly when written to csv
/*
some code which sets the outputDataset
*/
outputDataset
.repartition(1)
.write()
.partitionBy("date")
.format("com.crealytics.spark.excel")
.option("header", "true")
.save("/saveLoc/sales");
Normal Working Case:
When I pass use .format("csv"), the above code creates a folder with the name date=2021-04-14 in the path /saveLoc/sales that is passed in .save() which is exactly as expected. The full path of the end file is /saveLoc/sales/date=2021-04-14/someFileName.csv. Also, the column date is removed from the file since it was partitioned on.
What I need to do:
However, when I use .format("com.crealytics.spark.excel"), it just creates a plain file called sales in the folder saveLoc and doesn't remove the partitioned(date) column from the end file. Does that mean it isn't partitioning on the column "date"? Full path of the file created is /saveLoc/sales. Please note that it overrides the folder "sales" with a file sales.
Excel plugin used is descibed here: https://github.com/crealytics/spark-excel
How can I make it parition when writing in excel? In other words, how can I make it behave exactly as it did in case of csv?
Versions used:
spark-excel: com.crealytics.spark-excel_2.11
scala: org.apache.spark.spark-core_2.11
Thanks.

Related

Write parquet with partitionby vs. just a loop [duplicate]

This question already has answers here:
Overwrite only some partitions in a partitioned spark Dataset
(3 answers)
Closed 2 years ago.
Let's say have a script which writes a parquet file every week in 2 partitions: DAY and COUNTRY, in a FOLDER.
SOLUTION 1:
df.write.parquet(FOLDER, mode='overwrite',
partitionBy=['DAY', 'COUNTRY'])
The problem with this is that if later you want to rerun the script just for a specific country and date due to corrupted data in that partition, it will delete the whole folder's contents, and write in data just for the speciffic day/country.
APPEND also doesnt solve it, it would just append the correct data to the wrong one.
What would be ideal is that if the above command ONLY overwrote the DAY/COUNTRY combos which the df has.
SOLUTION 2:
Make a loop:
for country in countries:
for day in days:
df.write.parquet(FOLDER/day/country, mode='overwrite')
This works, because if I run the script, it only overwrites the files in the specific FOLDER/day/country, it just feels so wrong. Any better alternative?
If you are using spark 2.3 or above, you can create a partitioned table and
set the spark.sql.sources.partitionOverwriteMode setting to dynamic
spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
df.write.mode("overwrite").insertInto("yourtable")

Select text files from s3 bucket to read in scala

I have text files in an s3 bucket with filenames like this
file 1 -> bucket/directory/date=2020-05-01/abc2020-05-01T05.37xyzds.txt
file 2 -> bucket/directory/date=2020-05-01/def2020-05-01T06.37pqrst.txt
file 3 -> bucket/directory/date=2020-05-01/ghi2020-05-01T07.37lmnop.txt
I need to read files that are written to this directory this hour. For instance, assuming today's date as - 2020-05-01 and time as 7.40 UTC. Then I need to just read file 3 and skip the rest
I want to read these selected files to an rdd where my processing starts. Right now I am loading all the files into an rdd and filtering it out based on the timestamp column. But this is very time-consuming. My current read statement looks like this.
val rdd = sc.wholeTextFiles("s3a://bucket/directory/date=2020-05-01/")
Any ideas welcome! Thanks

data losing while reading a file of huge size in spark scala

val data = spark.read
.text(filepath)
.toDF("val")
.withColumn("id", monotonically_increasing_id())
val count = data.count()
This code works fine when I am reading a file contains upto 50k+ rows.. but when a file comes with rows more than that , this code starts losing data.when this code reads a file having 1 million+ rows , the final datframe count only gives 65k+ rows data.
I can't understand where the problem is happening in this code and what needs to change in this code so that it will ingest every data in the final dataframe.
p.s - highest file this code will ingest , having almost 14 million + rows , currently this code ingests only 2 million rows out of them.
Seems related to How do I add an persistent column of row ids to Spark DataFrame?
i.e. avoid using monotonically_increasing_id and follow some of the suggestions from that thread.

How to view specific changes in data at particular version in Delta Lake

Right now I have one test data which have 1 partition and inside that partition it has 2 parquet files
If I read data as:
val df = spark.read.format("delta").load("./test1510/table#v1")
Then I get latest data with 10,000 rows and if I read:
val df = spark.read.format("delta").load("./test1510/table#v0")
Then I get 612 rows, now my question is: How can I view only those new rows which were added in version 1 which is 10,000 - 612 = 9388 rows only
In short at each version I just want to view which data changed. Overall in delta log I am able to see json files and inside there json file I can see that it create separate parquet file at each version but how can I view it in code ?
I am using Spark with Scala
you don't even need to go at parquet file level. you could simply use SQL query to achieve this.
%sql
SELECT * FROM test_delta VERSION AS OF 2 minus SELECT * FROM test_delta VERSION AS OF 1
Above code will give you a newly added rows in version 2 which were not in version 1
in your case you can do the following
val df1 = spark.read.format("delta").load("./test1510/table#v1")
val df2 = spark.read.format("delta").load("./test1510/table#v0")
display(df2.except(df1))

how to load specific row and column from an excel sheet through pyspark to HIVE table?

I have an excel file having 4 worksheets. Each worksheet has first 3 rows as blank, i.e. the data starts from row number 4 and that continues for thousands of rows further.
Note: As per the requirement I am not supposed to delete the blank rows.
My goals are below
1) read the excel file in spark 2.1
2) ignore the first 3 rows, and read the data from 4th row to row number 50. The file has more than 2000 rows.
3) convert all the worksheets from the excel to separate CSV, and load them to existing HIVE tables.
Note: I have the flexibility of writing separate code for each worksheet.
How can I achieve this?
I can create a Df to read a single file and load it to HIVE. But I guess my requirement would need more than that.
You could for instance use the HadoopOffice library (https://github.com/ZuInnoTe/hadoopoffice/wiki).
There you have the following options:
1) use Hive directly to read the Excel files and to CTAS to a table in CSV format
You would need to deploy the HadoopOffice Excel Serde
https://github.com/ZuInnoTe/hadoopoffice/wiki/Hive-Serde
then you need to create the table (see documentation for all the option, the example reads from sheet1 and skips the first 3 lines)
create external table ExcelTable(<INSERTHEREYOURCOLUMNSPECIFICATION>) ROW FORMAT SERDE 'org.zuinnote.hadoop.excel.hive.serde.ExcelSerde' STORED AS INPUTFORMAT 'org.zuinnote.hadoop.office.format.mapred.ExcelFileInputFormat' OUTPUTFORMAT 'org.zuinnote.hadoop.excel.hive.outputformat.HiveExcelRowFileOutputFormat' LOCATION '/user/office/files' TBLPROPERTIES("hadoopoffice.read.simple.decimalFormat"="US","hadoopoffice.read.sheet.skiplines.num"="3", "hadoopoffice.read.sheet.skiplines.allsheets"="true", "hadoopoffice.read.sheets"="Sheet1","hadoopoffice.read.locale.bcp47"="US","hadoopoffice.write.locale.bcp47"="US");
Then do CTAS into a CSV format table:
create table CSVTable ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' AS Select * from ExcelTable;
2) use Spark
Depending on the Spark version you have different options:
for Spark 1.x you can use the HadoopOffice fileformat and for Spark 2.x the Spark2 DataSource (the latter would also include support for Python). See howtos here