pyspark read delta csv file by date - pyspark

I have several csv files in a folder. please refer to below screenshot.
The files with '20221205' are delta files and are newly uploaded into the folder today.
I want to read these 2 delta csv files only, and do some transformation and then append to existing table.
Every day, i will upload 2 files with current data as suffix, then run the note to handle the files uploaded today only.
Question: how to read only today's file only by pyspark??
How should I load the delta

what you call delta is actually a normal csv file with different prefix, not to be confused with delta data format.
you can read the prefix using glob patterns, simply put the date into the path string and it will read only the files ending with the suffix of the date:
spark.read.csv("path/to/folder/*20221205.csv")
I recommend however, if possible, storing the csv partitioned in your file system. this means each date is in a separate folder.
The file system will be something like:
folder
date=2022-01-01
date=2022-01-02
....
then you can simply:
spark.read.csv('folder').filter(col('date') == '2022-01-02')
the filter on the date will take milliseconds since the data is partitioned, behind the scenes spark knows that csvs with date = X are stored ONLY in date=X folder.

Related

Data Factory - Can I use the date field in a CSV to determine the destination folder in Copy Activity

I have some CSV files that I want to copy to a specific folder in ADLS based on the date column within the file.
i.e. CSV file has a column named "date" that reads "2022-02-23" on all rows. I want to copy that file to a folder that has the corresponding year and month, such as "/curated/UK/ProjectABC/2022/02"
I've got a Lookup activity that's pointing to the source CSV file and populating a Set Variable activity with the month using this dynamic content - #substring(string(activity('Lookup1').output.firstrow.date),5,2)
Would this be the right approach, to use a variable?
I cant use variables in the Directory portion of the Sink Dataset, as far as I know.
Have you come across this situation before?
Sounds like you're on the right path. You can use absolutely use Dataset parameters:
Then populate them in your pipeline using a variable (or parameter, or expression):

Upload multiple files to pentaho

In pentaho data integration, how do I import a list of xlsx files that are in the same folder?
note: the number of columns are always the same
If your excel column name and sheet name are always same then you can use THIS solution. Here I have take all xlsx file from source folder and convert one-by-one file as csv.
But if your excel columnname and sheet name are dynamic or you need some dynamic solution. Then you can use my another stackoverflow solution from Here

Select text files from s3 bucket to read in scala

I have text files in an s3 bucket with filenames like this
file 1 -> bucket/directory/date=2020-05-01/abc2020-05-01T05.37xyzds.txt
file 2 -> bucket/directory/date=2020-05-01/def2020-05-01T06.37pqrst.txt
file 3 -> bucket/directory/date=2020-05-01/ghi2020-05-01T07.37lmnop.txt
I need to read files that are written to this directory this hour. For instance, assuming today's date as - 2020-05-01 and time as 7.40 UTC. Then I need to just read file 3 and skip the rest
I want to read these selected files to an rdd where my processing starts. Right now I am loading all the files into an rdd and filtering it out based on the timestamp column. But this is very time-consuming. My current read statement looks like this.
val rdd = sc.wholeTextFiles("s3a://bucket/directory/date=2020-05-01/")
Any ideas welcome! Thanks

TalendOpenStuido DI Replace content of one column of .slx File with another column of .csv file

I have two input files:
an .xlsx file that looks like this:
an .csv files that looks like this:
I already have a talend job that transforms the .xlsx file into an .xml file.
One node in the .xml file contains the
<stockLocationCode>SL213</stockLocationCode>
The output .xml file looks like this:
Now I need to replace every occurence of the stockLocationCode with the second column of the .csv file. In this case the result would be:
My talend job looks like this:
I use a tMap component to put the columns of the .xlsx file into the right node of the output xml file.
But I do not know how I can peplace the StockLocactionCode with the acutal full stock location using the .csv file. I tired to also map the .csv file with the tMap component.
I would neet to build in a methof that looks at the current value of the node <stockLocationCode> and loops over the whole .csv file until it find it in the first column of the .csv file and then replace the <stockLocationCode> content with the content of the second column of the .csv file.
Performance is not important ;)
First, you'll need a lookup in e.g. a tMap or tXMLMap component, where you map your keys and add a new column with the second column of the csv file
The resulting columns would look like this:
Product; Stock Location Code; CSV 2nd column data
Now in a second map you could just remove the stock location code and do the rest of your job.
Voila, you exchanged the columns.
u can use tXMLMap which lookup

Perl - Scanning CSV files for rows that match user-specified criteria?

I am trying to write/learn a simple Perl parser for some CSV files that I have and I need some help.
In a directory I have a series of date-indexed CSV files in the form of Text-Date.csv. The date is in the form of Month-DD-YYYY (ex., January-07-2011). For each weekday there is a CSV file generated.
The Perl script should open each file look for a particular row that matches a user-entered criteria and return that row. Each row is stock price data with different stocks in different rows. What the script should do is return the price of a particular stock (ex., IBM) across all dates that CSVs are generated.
I have the parser working for a specific CSV/date that I choose, but I want to be able to pluck out the row in all CSVs. Also when I print the IBM price for each dated CSV I want to display the date next to the price (ex., January-07-2011 IBM 147.93).
Can you help me get this done?
If your question is how to crawl a bunch of files and run some function on each one, you probably want File::Find. To parse CSV, definitely use Text::xSV and not a custom parser. There is more to parsing CSV than calling split(",").
To parse CSV files, use the Text::CSV module.
It is more complex to decide how you are going to apply the criteria - you'll need to determine what the user specifies and work out how to translate that into Perl code that evaluates the condition correctly.