ADF / Dataflow - Convert Multiple CSV to Parquet - azure-data-factory

In ADLS Gen2, TextFiles folder has 3 CSV files. Column names are different in each file.
We need to convert all 3 CSV files to 3 parquet files and put it in ParquetFiles folder
I tried to use Copy Activity and it fails because the column names have empty space in it and parquet files doesn't allow it
To remove spaces, I used Data flow: Source -> Select (replace space by underscore in col name) and sink. This worked for a single file. When I tried to do it for all 3 files, it tries to merge 3 files and generates single file with incorrect data.
How to solve this, mainly removing spaces from column names in all files. What would be the other options here?

Pipeline: ForEach activity (loop over CSV files in folder and send in current iteration item to data flow as param) -> Data Flow activity with source that points to that folder (parameterize the file name in the source path)

I created 2 datasets, one in csv with wildcard format, the other in parquet. I used the Data Copy Activity using the parquet data set as sink and csv data set as source. I set the copy behavior to Merge files.

Related

aws_s3.query_export_to_s3 PostgreSQL RDS extension exporting all multi-part CSV files to S3 with a header

I'm using the aws_s3.query_export_to_s3 function to export data from an Amazon Aurora Postgresql database to S3 in CSV format with a header row.
This works.
However, when the export is large and outputs to multiple part files, the first part file has the CSV header row, and subsequent part files do not.
SELECT * FROM aws_s3.query_export_to_s3(
'SELECT ...',
aws_commons.create_s3_uri(...),
options:='format csv, HEADER true'
);
How can I make this export add the header row to all CSV file parts?
I'm using Apache Spark to load this CSV data and it expects a header row in each individual part file.
How can I make this export add the header row to all part filess?
It's not possible, unfortunately.
The aws_s3.query_export_to_s3 function uses the PostgreSQL COPY command under the hood & then chunks the files appropriately depending on size.
Unless the extension picks up on the HEADER true option, caches the header & then provides an option to apply that to every CSV file generated, you're out of luck.
The expectation is that the files are then combined at destination when downloaded or the file processor has some mechanism of reading files in parts or the file processor only needs the header once.

How to create unique list of codes from multiple files in multiple subfolders in ADF?

We have a folder structure in data lake like this:
folder1/subfolderA/parquet files
folder1/subfolderB/parquet files
folder1/subfolderC/parquet files
folder1/subfolderD/parquet files
etc.
All the parquet files have the same schema, and all the parquet files have, amongst other fields, a code field, let's call it code_XX.
Now I want from all parquet files in all folders the distinct value of code_XX.
So if code_XX, value 'A345' resides multiple times in the parquet files in subfolderA and subfolderC, I only want it once.
Output must be a Parquet file with all unique codes.
Is this doable in Azure Data Factory, and how?
If not, can it be done in Databricks?
You can try as below.
Set source folder path to recursively look for all parquet files and choose a column to store the file names.
As it seems you only need file names in output parquet file, use select to have only that column forward.
Use expression in derived column to get the file names from path string.
distinct(array(right(fileNames,locate('/',reverse(fileNames))-1)))
If you have access to SQL, it can be done with two copy activities, no need for data flows.
Copy Activity 1 (Parquet to SQL): Ingest all files into a staging table.
Copy Activity 2 (SQL to Parquet): Select DISTINCT code_XX from the staging table.
NOTE:
Use Mapping to only extract the column you need.
Use a wildcard file path with the recursive option enabled to copy all files from subfolders. https://learn.microsoft.com/en-us/azure/data-factory/connector-azure-blob-storage?tabs=data-factory#blob-storage-as-a-source-type

Merge different schema from blob container into a single sql table

I have to read 10 files from a folder in blob container with different schema(most of the schema among the table macthes) and merge them into a single SQL table
file 1: lets say there are 25 such columns
file 2: Some of the column in file2 matches with columns in file1
file 3:
output: a sql table
How to setup a pipeline in azure data factory to merge these columns into a single SQL table.
my approach:
get Metadata Activity---> for each childitems--- copy activity
for the mapping--- i constructed a json that containes all the source/sink columns from these files
You can create a JSON file which contains your each source file name and Tabular Translator. Then use Lookup activity to get this file's content(Don't check first row only). Loop this array in For Each activity and pass source file name in your dataset. Finally, create a copy data activity and use Tabular Translator as your mapping.

Google Cloud Data Fusion is appending a column to original data

When I am loading data encrypted data from GCS source to GCS sink there one additional column getting added.
Original data
Employee ID,Employee First Name,Employee Last Name,Employee Joining Date,Employee location
1,Vinay,Argekar,01/01/2017,India
2,Thirukkumaran,Haridass,02/02/2017,USA
3,David,Wu,03/04/2000,Canada
4,Vinod,Kumar,04/02/2002,India
5,Joshua,Abraham,04/15/2010,France
6,Allaudin,Dastigar,09/24/2012,UK
7,Senthil,Kumar,08/15/2009,Germany
8,Sudha,Narayanan,12/14/2016,India
9,Ravi,Prasad,11/11/2011,Costa Rica
Data came to file after running pipeline
0,Employee ID,Employee First Name,Employee Last Name,Employee Joining Date,Employee location
91,1,Vinay,Argekar,01/01/2017,India
124,2,Thirukkumaran,Haridass,02/02/2017,US
164,3,David,Wu,03/04/2000,Canada
193,4,Vinod,Kumar,04/02/2002,India
224,5,Joshua,Abraham,04/15/2010,France
259,6,Allaudin,Dastigar,09/24/2012,UK
293,7,Senthil,Kumar,08/15/2009,Germany
328,8,Sudha,Narayanan,12/14/2016,India
363,9,Ravi,Prasad,11/11/2011,Costa Rica
First column 0 was not present in original file
When you are configuring the GCS source, did you specify the Format to be CSV or was it left as Text? When the Format is Text, the output schema actually contains an offset, which is the first column that first column that you see in the output data. When you specify the format to be CSV, you have to specify the output schema of the file.

How to write csv file into one file by pyspark

I use this method to write csv file. But it will generate a file with multiple part files. That is not what I want; I need it in one file. And I also found another post using scala to force everything to be calculated on one partition, then get one file.
First question: how to achieve this in Python?
In the second post, it is also said a Hadoop function could merge multiple files into one.
Second question: is it possible merge two file in Spark?
You can use,
df.coalesce(1).write.csv('result.csv')
Note:
when you use coalesce function you will lose your parallelism.
You can do this by using the cat command line function as below. This will concatenate all of the part files into 1 csv. There is no need to repartition down to 1 partition.
import os
test.write.csv('output/test')
os.system("cat output/test/p* > output/test.csv")
Requirement is to save an RDD in a single CSV file by bringing the RDD to an executor. This means RDD partitions present across executors would be shuffled to one executor. We can use coalesce(1) or repartition(1) for this purpose. In addition to it, one can add a column header to the resulted csv file.
First we can keep a utility function for make data csv compatible.
def toCSVLine(data):
return ','.join(str(d) for d in data)
Let’s suppose MyRDD has five columns and it needs 'ID', 'DT_KEY', 'Grade', 'Score', 'TRF_Age' as column Headers. So I create a header RDD and union MyRDD as below which most of times keeps the header on top of the csv file.
unionHeaderRDD = sc.parallelize( [( 'ID','DT_KEY','Grade','Score','TRF_Age' )])\
.union( MyRDD )
unionHeaderRDD.coalesce( 1 ).map( toCSVLine ).saveAsTextFile("MyFileLocation" )
saveAsPickleFile spark context API method can be used to serialize data that is saved in order save space. Use pickFile to read the pickled file.
I needed my csv output in a single file with headers saved to an s3 bucket with the filename I provided. The current accepted answer, when I run it (spark 3.3.1 on a databricks cluster) gives me a folder with the desired filename and inside it there is one csv file (due to coalesce(1)) with a random name and no headers.
I found that sending it to pandas as an intermediate step provided just a single file with headers, exactly as expected.
my_spark_df.toPandas().to_csv('s3_csv_path.csv',index=False)