Retention scripts to container data - pyspark

I'm trying to do something to apply data retention policies to my data stored in container storage in my data lake. The content is structured like this:
2022/06/30/customer.parquet
2022/06/30/product.parquet
2022/06/30/emails.parquet
2022/07/01/customer.parquet
2022/07/01/product.parquet
2022/07/01/emails.parquet
That's basically every day a new file is added, using the copy task from azure data factory. There are in reality more than 3 files per day.
I want to start applying different retention policies to different files. For example, the emails.parquet files, I want to delete the entire file after it is 30 days old. The customer files, I want to anonymise by replacing the contents of certain columns with some placeholder text.
I need to do this in a way that preserves the next stage of data processing - which is where pyspark scripts read all data for a given type (e.g. emails, or customer), transform it and output it to a different container.
So to apply the retention changes mentioned above, I think I need to iteratively look through the container, find each file (each emails file, or each customer file), do the transformations, and then output (overwrite) the original file. I'd plan to use pyspark notebooks for this, but I don't know how to iterate through folder structures in a container.
As for making date comparisons to decide if my data is to be not retained, I can either use the folder structures for the dates (but I don't know how to do this), or there's a "RowStartDate" in every parquet file that I can use too.
Can anybody help point me in the right direction of how to achieve what I wish, either by the route I'm alluding to above (pyspark script to iterate through container folders, add data to data frame, transform, then overwrite original file) or any other means.

Related

Reading Multiple folders parallely

i have multiple part folders each containing parquet files (ex given below). Now across a part-folder the schema can be different (either the num of cols or the datatype of certain col). My requirement is that i have to read all the part folders and finally create a single df according to a predefined passed schema.
/feed=abc -> contains multiple part folders based on date like below
/feed=abc/date=20221220
/feed=abc/date=20221221
.....
/feed=abc/date=20221231
Since i am not sure what type of changes are there in which part folders i am reading each part folder individually then comparing the schema with teh predefined schema and making the necessary changes i..e, adding/dropping col or typecasting the col datatype. Once done i am writing the result into a temp location and then moving on to the next part folder and repeating the same operation. Once all the part-folders are read i am reading the temp location at one go to get the final output.
Now i want to do this operation parallely, i.e., there will be parallel thread/process (?) which will read part-folders parallely and then execute teh logic of schema comparison and any changes necessary and write into a temp location . Is this thing possible ?
i searched for parallel processing of multi-dir in here but in majority of the scenarios they have same schema across dir so somehow they are using wildcard to read the input path location and create the df, but that is not going to work in my case. The problem statement in the below path is similar to mine but in my case the num of part folders to be read is random and sometimes over 1000. Moreover there are operation involved in comparing the fixing the col types as well.
Any help will be appreciated.
Reading multiple directories into multiple spark dataframes
Divide your existing ETL into two phases. The first one transforms existing data into the appropriate schema, and the second one reads the transformed data convenient way (with * symbols). Use Airflow (or Oozie) to start one data transformer application per directory. And after all instances of the data transformer are successfully finished, run the union app.

Apache Druid Appending Segment without dropping or summing it

I have three JSON files with the same timestamp but different values to upload to the Druid. I want to upload them separately with the same segment granularity. However, it drops the existing segment and uploads the new one.
I don't want to use appendToExisting: True bc it sums the values of the same rows. This is the situation that I don't want to happen (I may be adding the same file in the future).
Is there a way to add new data to a specific segment without dropping or summing it?

Bulk load causing schema drift in files (adf pipeline or mapping dataflows)

Bit of a challenge here
I have around 45,000 historic .parquet files
partitioned like this yyyy,mm,dd (2021/08/19) in the dd level I have 24 files (one for each hour)
The columns in each day file are pretty wide, anything up to 250 columns. It has increased and decreased over time, hence there being schema drift when trying to load into SQL using mapping dataflows that made the file larger.
Around 200 of those columns I require and I know what they are. I even have them in a schema template. The rest are legacy or unwanted
I'd like to retain the original files in blob as they are, but load files with those 200 columns per file into SQL.
What is the best way to achieve this?
How do I iterate over every file but only take the columns I need?
I tried using a wildcard path
'2021/**/*.parquet'
within mapping dataflows to pick up All files in blob so I don't have to iterate creating multiple clusters or a foreach
I'm not even sure how to handle this or whether it should be a copy activity or a mapping df
both have their benefits but I think I can only use mapping df if I need to transform parts of these files in depth.
should I be combining the months or even years into a single file then trying to read from this files so I can exclude the additional from the columns I want to take into SQL server.
ideally this is a bulk load that need some refinement when it lands.
Thank in advance
Add a data flow to the pipeline and use a Select transformation to choose the columns you wish to propagate. You can create pattern-based rules in the data flow Select transformation to choose the columns that you wish to pick from each file schema.

Dataprep "create dataset with parameters" does not take all files

I am struggling with having files in the way that Dataprep intends to have, when importing a parameterized dataset from Google Cloud Storage (GCS).
To be specific:
I store .csv files on GCS in a location:
/20190807/file1.csv
/20190807/file2.csv
/20190807/file3.csv
/20190808/file1.csv
/20190808/file2.csv
/20190808/file3.csv
/20190809/file1.csv
/20190809/file2.csv
/20190809/file3.csv
...
Then I create a Dataset with Parameters on this location using a *wildcard.
In addition:
Encoding: I apply detect automatic structure, and I select UTF-8 (as I store all my files with this encoding).
Columns: I also make sure that all files have the same columns.
Problem:
However, for some reason or another, depending how the file has been saved I guess, Dataprep does not take all the files when importing them. When I take those two files, I cannot identify what is different about them. Both have been saved as a type application/octet-stream and I apply a UTF-8 encoding.
As a consequence, when I export my output after the dataset is wrangled in a flow, I miss some dates (eg 20190808).
Is there therefore a tool that I can compare these two files, to see what is different about them, in order to prevent these things of happening. It is not an option to store them in different locations as I do not know in advance which files will be different.
I am really surprised about this shortcoming, and it would be great to have somehow a way to only check the columns for each files instead of also checking other "hidden" differences.

How to add record numbers to TextIO file sources in Apache Beam or Dataflow

I am using Dataflow (and now Beam) to process legacy text files to replicate the transformations of an existing ETL tool. The current process adds a record number (the record number for each row within each file) and the filename. The reason they want to keep this additional info is so that they can tell which file and record offset the source data came from.
I want to get to a point where I have a PCollection which contains File record number and filename as additional fields in the value or part of the key.
I've seen a different article where the filename can be populated into the resulting PCollection, however I do not have a solution for adding the record numbers per row. Currently the only way I can do it is to pre-process the files before I start the Dataflow process (which is a shame since I would want to have Dataflow/Beam to do it all)