Select text files from s3 bucket to read in scala - scala

I have text files in an s3 bucket with filenames like this
file 1 -> bucket/directory/date=2020-05-01/abc2020-05-01T05.37xyzds.txt
file 2 -> bucket/directory/date=2020-05-01/def2020-05-01T06.37pqrst.txt
file 3 -> bucket/directory/date=2020-05-01/ghi2020-05-01T07.37lmnop.txt
I need to read files that are written to this directory this hour. For instance, assuming today's date as - 2020-05-01 and time as 7.40 UTC. Then I need to just read file 3 and skip the rest
I want to read these selected files to an rdd where my processing starts. Right now I am loading all the files into an rdd and filtering it out based on the timestamp column. But this is very time-consuming. My current read statement looks like this.
val rdd = sc.wholeTextFiles("s3a://bucket/directory/date=2020-05-01/")
Any ideas welcome! Thanks

Related

pyspark read delta csv file by date

I have several csv files in a folder. please refer to below screenshot.
The files with '20221205' are delta files and are newly uploaded into the folder today.
I want to read these 2 delta csv files only, and do some transformation and then append to existing table.
Every day, i will upload 2 files with current data as suffix, then run the note to handle the files uploaded today only.
Question: how to read only today's file only by pyspark??
How should I load the delta
what you call delta is actually a normal csv file with different prefix, not to be confused with delta data format.
you can read the prefix using glob patterns, simply put the date into the path string and it will read only the files ending with the suffix of the date:
spark.read.csv("path/to/folder/*20221205.csv")
I recommend however, if possible, storing the csv partitioned in your file system. this means each date is in a separate folder.
The file system will be something like:
folder
date=2022-01-01
date=2022-01-02
....
then you can simply:
spark.read.csv('folder').filter(col('date') == '2022-01-02')
the filter on the date will take milliseconds since the data is partitioned, behind the scenes spark knows that csvs with date = X are stored ONLY in date=X folder.

Data Factory - Can I use the date field in a CSV to determine the destination folder in Copy Activity

I have some CSV files that I want to copy to a specific folder in ADLS based on the date column within the file.
i.e. CSV file has a column named "date" that reads "2022-02-23" on all rows. I want to copy that file to a folder that has the corresponding year and month, such as "/curated/UK/ProjectABC/2022/02"
I've got a Lookup activity that's pointing to the source CSV file and populating a Set Variable activity with the month using this dynamic content - #substring(string(activity('Lookup1').output.firstrow.date),5,2)
Would this be the right approach, to use a variable?
I cant use variables in the Directory portion of the Sink Dataset, as far as I know.
Have you come across this situation before?
Sounds like you're on the right path. You can use absolutely use Dataset parameters:
Then populate them in your pipeline using a variable (or parameter, or expression):

Write Spark Dataset to Excel File along with partitioning

I have a Dataset similar to the below structure:
col_A col_B date
1 5 2021-04-14
2 7 2021-04-14
3 5 2021-04-14
4 9 2021-04-14
I am trying to use the below code in Spark Java to write the dataaset to a file in HDFS.
Dataset<Row> outputDataset; // This is a valid dataset and works flawlessly when written to csv
/*
some code which sets the outputDataset
*/
outputDataset
.repartition(1)
.write()
.partitionBy("date")
.format("com.crealytics.spark.excel")
.option("header", "true")
.save("/saveLoc/sales");
Normal Working Case:
When I pass use .format("csv"), the above code creates a folder with the name date=2021-04-14 in the path /saveLoc/sales that is passed in .save() which is exactly as expected. The full path of the end file is /saveLoc/sales/date=2021-04-14/someFileName.csv. Also, the column date is removed from the file since it was partitioned on.
What I need to do:
However, when I use .format("com.crealytics.spark.excel"), it just creates a plain file called sales in the folder saveLoc and doesn't remove the partitioned(date) column from the end file. Does that mean it isn't partitioning on the column "date"? Full path of the file created is /saveLoc/sales. Please note that it overrides the folder "sales" with a file sales.
Excel plugin used is descibed here: https://github.com/crealytics/spark-excel
How can I make it parition when writing in excel? In other words, how can I make it behave exactly as it did in case of csv?
Versions used:
spark-excel: com.crealytics.spark-excel_2.11
scala: org.apache.spark.spark-core_2.11
Thanks.

How to automatically transfer data

I have thousands of csv files and they basically have 2 formats. One type of 2 formats is that in those csv files there are 100 rows and 2 columns. The other type of csv files has 50 columns and 5 rows. The numbers are given just to provide an example.
What I want to do is to write a Matlab code that will extract the complete second row of the csv files with the first format and make it the first row of the csv files with the second format. The number of the csv files with the first and second format is equal.
Any help is appreciated.

how to load specific row and column from an excel sheet through pyspark to HIVE table?

I have an excel file having 4 worksheets. Each worksheet has first 3 rows as blank, i.e. the data starts from row number 4 and that continues for thousands of rows further.
Note: As per the requirement I am not supposed to delete the blank rows.
My goals are below
1) read the excel file in spark 2.1
2) ignore the first 3 rows, and read the data from 4th row to row number 50. The file has more than 2000 rows.
3) convert all the worksheets from the excel to separate CSV, and load them to existing HIVE tables.
Note: I have the flexibility of writing separate code for each worksheet.
How can I achieve this?
I can create a Df to read a single file and load it to HIVE. But I guess my requirement would need more than that.
You could for instance use the HadoopOffice library (https://github.com/ZuInnoTe/hadoopoffice/wiki).
There you have the following options:
1) use Hive directly to read the Excel files and to CTAS to a table in CSV format
You would need to deploy the HadoopOffice Excel Serde
https://github.com/ZuInnoTe/hadoopoffice/wiki/Hive-Serde
then you need to create the table (see documentation for all the option, the example reads from sheet1 and skips the first 3 lines)
create external table ExcelTable(<INSERTHEREYOURCOLUMNSPECIFICATION>) ROW FORMAT SERDE 'org.zuinnote.hadoop.excel.hive.serde.ExcelSerde' STORED AS INPUTFORMAT 'org.zuinnote.hadoop.office.format.mapred.ExcelFileInputFormat' OUTPUTFORMAT 'org.zuinnote.hadoop.excel.hive.outputformat.HiveExcelRowFileOutputFormat' LOCATION '/user/office/files' TBLPROPERTIES("hadoopoffice.read.simple.decimalFormat"="US","hadoopoffice.read.sheet.skiplines.num"="3", "hadoopoffice.read.sheet.skiplines.allsheets"="true", "hadoopoffice.read.sheets"="Sheet1","hadoopoffice.read.locale.bcp47"="US","hadoopoffice.write.locale.bcp47"="US");
Then do CTAS into a CSV format table:
create table CSVTable ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' AS Select * from ExcelTable;
2) use Spark
Depending on the Spark version you have different options:
for Spark 1.x you can use the HadoopOffice fileformat and for Spark 2.x the Spark2 DataSource (the latter would also include support for Python). See howtos here