Bulk load causing schema drift in files (adf pipeline or mapping dataflows) - azure-data-factory

Bit of a challenge here
I have around 45,000 historic .parquet files
partitioned like this yyyy,mm,dd (2021/08/19) in the dd level I have 24 files (one for each hour)
The columns in each day file are pretty wide, anything up to 250 columns. It has increased and decreased over time, hence there being schema drift when trying to load into SQL using mapping dataflows that made the file larger.
Around 200 of those columns I require and I know what they are. I even have them in a schema template. The rest are legacy or unwanted
I'd like to retain the original files in blob as they are, but load files with those 200 columns per file into SQL.
What is the best way to achieve this?
How do I iterate over every file but only take the columns I need?
I tried using a wildcard path
'2021/**/*.parquet'
within mapping dataflows to pick up All files in blob so I don't have to iterate creating multiple clusters or a foreach
I'm not even sure how to handle this or whether it should be a copy activity or a mapping df
both have their benefits but I think I can only use mapping df if I need to transform parts of these files in depth.
should I be combining the months or even years into a single file then trying to read from this files so I can exclude the additional from the columns I want to take into SQL server.
ideally this is a bulk load that need some refinement when it lands.
Thank in advance

Add a data flow to the pipeline and use a Select transformation to choose the columns you wish to propagate. You can create pattern-based rules in the data flow Select transformation to choose the columns that you wish to pick from each file schema.

Related

Reading Multiple folders parallely

i have multiple part folders each containing parquet files (ex given below). Now across a part-folder the schema can be different (either the num of cols or the datatype of certain col). My requirement is that i have to read all the part folders and finally create a single df according to a predefined passed schema.
/feed=abc -> contains multiple part folders based on date like below
/feed=abc/date=20221220
/feed=abc/date=20221221
.....
/feed=abc/date=20221231
Since i am not sure what type of changes are there in which part folders i am reading each part folder individually then comparing the schema with teh predefined schema and making the necessary changes i..e, adding/dropping col or typecasting the col datatype. Once done i am writing the result into a temp location and then moving on to the next part folder and repeating the same operation. Once all the part-folders are read i am reading the temp location at one go to get the final output.
Now i want to do this operation parallely, i.e., there will be parallel thread/process (?) which will read part-folders parallely and then execute teh logic of schema comparison and any changes necessary and write into a temp location . Is this thing possible ?
i searched for parallel processing of multi-dir in here but in majority of the scenarios they have same schema across dir so somehow they are using wildcard to read the input path location and create the df, but that is not going to work in my case. The problem statement in the below path is similar to mine but in my case the num of part folders to be read is random and sometimes over 1000. Moreover there are operation involved in comparing the fixing the col types as well.
Any help will be appreciated.
Reading multiple directories into multiple spark dataframes
Divide your existing ETL into two phases. The first one transforms existing data into the appropriate schema, and the second one reads the transformed data convenient way (with * symbols). Use Airflow (or Oozie) to start one data transformer application per directory. And after all instances of the data transformer are successfully finished, run the union app.

Retention scripts to container data

I'm trying to do something to apply data retention policies to my data stored in container storage in my data lake. The content is structured like this:
2022/06/30/customer.parquet
2022/06/30/product.parquet
2022/06/30/emails.parquet
2022/07/01/customer.parquet
2022/07/01/product.parquet
2022/07/01/emails.parquet
That's basically every day a new file is added, using the copy task from azure data factory. There are in reality more than 3 files per day.
I want to start applying different retention policies to different files. For example, the emails.parquet files, I want to delete the entire file after it is 30 days old. The customer files, I want to anonymise by replacing the contents of certain columns with some placeholder text.
I need to do this in a way that preserves the next stage of data processing - which is where pyspark scripts read all data for a given type (e.g. emails, or customer), transform it and output it to a different container.
So to apply the retention changes mentioned above, I think I need to iteratively look through the container, find each file (each emails file, or each customer file), do the transformations, and then output (overwrite) the original file. I'd plan to use pyspark notebooks for this, but I don't know how to iterate through folder structures in a container.
As for making date comparisons to decide if my data is to be not retained, I can either use the folder structures for the dates (but I don't know how to do this), or there's a "RowStartDate" in every parquet file that I can use too.
Can anybody help point me in the right direction of how to achieve what I wish, either by the route I'm alluding to above (pyspark script to iterate through container folders, add data to data frame, transform, then overwrite original file) or any other means.

Dataprep "create dataset with parameters" does not take all files

I am struggling with having files in the way that Dataprep intends to have, when importing a parameterized dataset from Google Cloud Storage (GCS).
To be specific:
I store .csv files on GCS in a location:
/20190807/file1.csv
/20190807/file2.csv
/20190807/file3.csv
/20190808/file1.csv
/20190808/file2.csv
/20190808/file3.csv
/20190809/file1.csv
/20190809/file2.csv
/20190809/file3.csv
...
Then I create a Dataset with Parameters on this location using a *wildcard.
In addition:
Encoding: I apply detect automatic structure, and I select UTF-8 (as I store all my files with this encoding).
Columns: I also make sure that all files have the same columns.
Problem:
However, for some reason or another, depending how the file has been saved I guess, Dataprep does not take all the files when importing them. When I take those two files, I cannot identify what is different about them. Both have been saved as a type application/octet-stream and I apply a UTF-8 encoding.
As a consequence, when I export my output after the dataset is wrangled in a flow, I miss some dates (eg 20190808).
Is there therefore a tool that I can compare these two files, to see what is different about them, in order to prevent these things of happening. It is not an option to store them in different locations as I do not know in advance which files will be different.
I am really surprised about this shortcoming, and it would be great to have somehow a way to only check the columns for each files instead of also checking other "hidden" differences.

Spark: Is it possible to load an RDD from multiple files in different formats?

I have an heterogeneously-formatted input of files, batch mode.
I want to run a batch over a number of files. These files are of different formats, and they will have different mappings to normalize data (e.g. extract fields with different schema names or positions in the records, to a standard naming).
Given the tabular nature of the data, I'm considering using Dataframes (cannot use datasets due to the Spark version I'm bound to).
In order to apply different extraction logic to each file - do they need to be loaded each file in a separate dataframe, then apply extraction logic (extraction of some files, a process which is different per each file type, configured in terms of e.g. CSV/JSON/XML, position of fields to select (CSV), name of field to select (JSON), etc.), and then join datasets?
That would force me to iterate files, and act on each dataframe separately, and join dataframes afterwards; instead of applying the same (configurable) logic.
I know I could make it with RDD, i.e. loading all files into the RDD, emitting PairRDD[fileId, record], and then run a map where you would look the fileId to get the configuration to apply to that record, which tells you which logic to apply.
I'd rather use Dataframes, for all of the niceties it offers over raw RDDS, in terms of performance, simplicity and parsing.
Is there a better way to use Dataframes to address this problem than the one already explained? Any suggestions or misconceptions I may have?
I'm using Scala, though it should not matter to this problem.

Working with huge csv files in Tableau

I have a large csv files (1000 rows x 70,000 columns) which I want to create a union between 2 smaller csv files (since these csv files will be updated in the future). In Tableau working with such a large csv file results in very long processing time and sometimes causes Tableau to stop responding. I would like to know what are better ways of dealing with such large csv files ie. by splitting data, converting csv to other data file type, connecting to server, etc. Please let me know.
The first thing you should ensure is that you are accessing the file locally and not over a network. Sometimes it is minor, but in some cases that can cause some major slow down in Tableau reading the file.
Beyond that, your file is pretty wide should be normalized some, so that you get more row and fewer columns. Tableau will most likely read it in faster because it has fewer columns to analyze (data types, etc).
If you don't know how to normalize the CSV file, you can use a tool like: http://www.convertcsv.com/pivot-csv.htm
Once you have the file normalized and connected in Tableau, you may want to extract it inside of Tableau for improved performance and file compression.
The problem isn't the size of the csv file: it is the structure. Almost anything trying to digest a csv will expect lots of rows but not many columns. Usually columns define the type of data (eg customer number, transaction value, transaction count, date...) and the rows define instances of the data (all the values for an individual transaction).
Tableau can happily cope with hundreds (maybe even thousands) of columns and millions of rows (i've happily ingested 25 million row CSVs).
Very wide tables usually emerge because you have a "pivoted" analysis with one set of data categories along the columns and another along the rows. For effective analysis you need to undo the pivoting (or derive the data from its source unpivoted). Cycle through the complete table (you can even do this in Excel VBA despite the number of columns by reading the CSV directly line by line rather than opening the file). Convert the first row (which is probably column headings) into a new column (so each new row contains every combination of original row label and each column header plus the relevant data value from the relevant cell in the CSV file). The new table will be 3 columns wide but with all the data from the CSV (assuming the CSV was structured the way I assumed). If I've misunderstood the structure of the file, you have a much bigger problem than I thought!