Is there any pyspark method to read multiple file with different header - pyspark

I have to migrate multiple files(around 2000) in same folder in azure blob storage. I want to read each file with header(as header is different for every file).
And write it into destination folder.
Is there anyway I can do it parallel via pyspark?
I am using below code, but it is only picking header from first file, which is producing wrong output.
Df.read.option(“header”, “true”).parquet(directory/*.parquet)
Df.write.option(“header”,”true”).csv(directory)
Please help me if you know how can I read all the files with source headers of their own.
Thanks!

Related

Azure Data Factory data flow file sink

I am using a .csv file to import data into an Azure SQL database. After the data import is complete I am now moving the source file from the Source container to myArchive container. I am now trying to save the filename as SaleData_yyyyMMdd_HHmm.csv, but, I have the folder with this name getting created and the file is broken down into multiple part files (part-00000-, part-00001-,...). Could you please guide me on how to specify the filename with current data & timestamp.
File System: myArchive
Folder Path: concat('SalesDepartment/Warehouse1/','SaleData_',toString(currentTimestamp(),'yyyyMMdd_HHmm'),'.csv')
Folder path can be mentioned directly in the sink dataset. (Note, my source and sink both are delimited type)
For filename,
Under sink data set, create a parameter to pass file name and use it in the file name portion of dataset.
Use the below expression in copy activity sink's parameter value
#concat('SaleData_',formatDateTime(utcnow(),'yyyyMMdd_HHmm'),'.csv')
Remember, this just copies your source in a different name. We need to add a delete activity to delete the original source file.
If you are using a dataflow,
make sure you are choosing single partition in the optimize tab of Sink instead of Use current Partitioning.
Then, go to Settings, choose Output to SIngle file. Under filename, mention the expression with timestamp.
concat('SaleData_',toString(currentUTC('yyyyMMdd_HHmm')),'.csv')

How to rename file name in ADF?

I am copying data from sql to adls dynamically, i want to rename the file name after copied into ADLS. How to achieve it? Requesting you suggest.
Thanks in Advance.
Regards,
Ashok
My first question would be "why bother renaming parquet files?" Hopefully you aren't generating a single parquet file, which would seem to defeat the purpose of using Parquet. Instead, my focus would be on the folder name.
OPTION 1
If I did care about the file names, I would use Data Flow and configure the Sink to use patterned naming:
You could then pass the desired file name in as Data Flow Parameter:
And set it dynamically using an expression:
[NOTE: I haven't tested this syntax, but I recommend you always use the Expression Builder to enter these expressions].
OPTION 2
If none of that suits your purposes, then aonther option would be brute force. Use a COPY activity with binary data sets to copy the file to a new file with the desired name, then a DELETE activity to remove the old one.

Google Data Fusion reading files from multiple sub folders in a bucket and need to place in another folder in side sub folder

Example
sameer/student/land/compressed files
sameer/student/pro/uncompressed files
sameer/employee/land/compressed files
sameer/employee/pro/uncompressed files
In the above example I need to read files from all LAND folders present in different sub directories and need to process them and place them in PRO folders with in same sub folders.
For this I have taken two GCS nodes one from source and another from sink.
in the GCS source i have provided path gs://sameer/ , it is reading files from all sub folders and merging them into one file placing it in sink path.
Excepted output all files should be placed in sub directories where i have fetched from.
It can achieve the excepted output by running pipeline separately for each folder
I am expecting is this can be possible by a single pipeline run
It seems like your use case is simply moving files. In that case, I would suggest using the Action plugin GCS Move or GCS Copy.
It seems like the task you are trying to carry out is not possible to do in one single Data Fusion pipeline, at least at the time of writing this.
In a pipeline, all the sources and sinks have to be connected. Otherwise you will get the following error:
'Invalid DAG. There is an island made up of stages ...'
This means it is not possible to parallelise several uncompression tasks, one for each folder of files, inside the same pipeline.
At the same time, if you were to use something like the following schema, the outputs would be aggregated and replicated over all of the sinks:
Finally, I would say that the only case in which you can parallelise a task between several sources and several links is when using multiple database tables. By means of the following plug-ins (2) and (3) you can process data from multiple table inputs and export the output to multiple tables. If you would like to see all available plugins for Data fusion, please check the following link (4).

Iterate each folder in Azure Data Factory

In our DataLake storage, we received unspecified amount of folders every day. Each of these folders contain at least one file.
Example of folders:
FolderA
|_/2020
|_/03
|_/12
|_fileA.json
|_/04
|_/13
|_fileB.json
FolderB
|_/2020
|_/03
|_/12
|_fileC.json
Folder C/...
Folder D/...
So on..
Now:
1. How do I iterate every folders and get the file(s) inside it?
I would also like to do 'Copy Data' from each of these files and make a single .csv file from it. What would be the best approach to achieve it?
This can be done with a single copy activity using wildcard filtering in the source dataset, as seen here: https://azure.microsoft.com/en-us/updates/data-factory-supports-wildcard-file-filter-for-copy-activity/
Then in the sink tab of the copy activity, select Merge Files in the Copy behavior as seen here:
If you have extra requirements, another way to do this is by using Mapping Dataflows. Mark Kromer explains a similar scenario here: https://kromerbigdata.com/2019/07/05/adf-mapping-data-flows-iterate-multiple-files-with-source-transformation/
Hope this helped!

Reading file from Google Drive with Talend

I need to read an uploaded file in Google Drive and perform X transformation with it. As per my reading, the single way to do it is by downloading the file to my local machine with the Talend component and then, reading from there.
If it is correct, I cannot figure what would be the file name assuming that I don't want to use the exact name of the file.
I found http://meowbi.com/2018/02/23/getting-google-sheet-gdrive-talend/ and it is exactly what I need - read from Google Drive, check the file name and proceed if the file name is X. What is unclear for me is what they used in tJava.
The output schema of tGoogleDriveList component's Main row contains a field name that is the file name you're looking for. Using Iterate row is less straightforward as you need to extract values from GlobalMap. In the article you cited they get file name by "tGoogleDriveList_1_TITLE" key of the GlobalMap.
Main row between tGoogleDriveList and tJava
For more details please look into the Talend Reference for Google Drive components. The Listing files and folders in Google Drive section should be particularly topical for your case.