ADF Data Flow creates empty file - azure-data-flow

I have setup a very basic data transformation using a "Data flow". I'm taking in a CSV file and modifying one of the columns and writing to a new CSV file in an "output" directory. I noticed that after the pipeline runs not only does it create the output folder but it also creates and empty file with the same name as the output folder.
Did I setup something wrong or is this empty file normal?
Sink Settings
Sink
Settings
Mapping
Optimize
Storage
Thank you,
Scott

Related

Load ForEach SQL into single CSV in Azure Data Factory

I have an ADF where I am executing a stored procedure in a ForEach and using Copy Data to load the output into a CSV file
On each iteration of the ForEach the CSV is being cleared down and loading that iteration's data
I require it to preserve the already loaded data and insert the output from the iteration
The CSV should have a full dataset of all iterations
How can I achieve this? I tried using the "Merge Files" option in the Sink Copy Behavior but doesn't work for SQL to CSV
As #All About BI mentioned, currently the append behavior which you are looking for is not supported.
You can raise a feature request from the ADF portal.
Alternatively, you can check the below process to append data in CSV.
In my repro, I am generating the loop items using Set Variable activity and passing it to ForEach activity.
Inside ForEach activity, using copy data activity, executing the stored procedure in Source, and copying data of Stored procedure to a CSV file.
In the Copy data activity sink, generate the file name using the current item of ForEach loop, to get data into different files for each iteration. Also adding a constant to identify the file name which can be deleted at the end after merging the files.
File_name: #concat('1_sp_sql_data_',string(item()),'.csv')
Add another copy data activity after the ForEach activity, to combine all the files data from the ForEach iteration to a single file. Here I am using the wildcard path (*) to get all files from the folder.
In Sink, add the destination filename with copy behavior as Merge files to copy all source data to a single sink file.
After merging the files data is copied to a single file, but the files will not be deleted. So when you run the pipeline next time, there is a chance the old files were also been merged with new files again.
• To avoid, this adding delete activity to delete the files generated in ForEach activity.
• As I have added a constant to generate these files, it will be easy to delete the files based on the filename (deleting all files which start with “1_”).
Destination file:
You could try this -
Load each iteration data to a separate csv file.
Later union them all or merge
As right now we don't have the ability to append rows in csv.

How to remove extra files when sinking CSV files to Azure Data Lake Gen2 with Azure Data Factory data flow?

I have done data flow tutorial. Sink currently created 4 files to Azure Data Lake Gen2.
I suppose this is related to HDFS file system.
Is it possible to save without success, committed, started files?
What is best practice? Should they be removed after saving to data lake gen2?
Are then needed in further data processing?
https://learn.microsoft.com/en-us/azure/data-factory/tutorial-data-flow
There are a couple of options available.
You can mention the output filename in Sink transformation settings.
Select Output to single file from the dropdown of file name option and give the output file name.
You could also parameterize the output file name as required. Refer to this SO thread.
You can add delete activity after the data flow activity in the pipeline and delete the files from the folder.

Invalid data stored in Delimited Text after running ADF Data Flow

I'm trying to store data from an input to csv file in blob storage via ADF data flow. The pipeline ran successfully. However on checking the csv file, I see some invalid data included. Here are the settings of Delimited Text and Sink. Please let me know what I am missing?
I tested and repeat the error.
The error is caused by that all the csv files in the csv/test folder have different schema.
Even if the pipeline runs with no error, but the data in to the single file will has the error.
In Data Factory, when we try to merge more files to one, or copy data from more files to single, the files in the folder must have the same schema.
Note that please using wildcard paths to filter all the csv files:
For example, I have two csv files which have same schema in the container:
Source dataset preview:
Only if the source dataset preview is correct, the output file also will be correct.

Can (Open Studio) Talend be used to automate a data load from a folder to vertica?

I have been looking at a way to automate my data loads into vertica instead of manually exporting flat files each time, and stumbled upon the ETL Talend.
I have been working with a test folder containing multiple csv files, and am attempting to find a way to build a job so the file can be put into vertica.
However, I see in the open studio version (free), if your files do not have the same schema, this becomes next to impossible without having the dynamic schema option which is in the enterprise version.
I start with tFileList and attempt to iterate through tFileInputDelimited, but the schemas are not uniform, so of course it will stop the processing.
So, long story short, am I correct in assuming that there is no way to automate data loads in the free version of Talend if you have a folder consisting of files with different schemas?
If anyone has any suggestions for other open source ETLs to look at or a solution that would be great.
You can access the CURRENT_FILE variable from a tFileList compenent and then send a file down different route depening on the file name. You'd then create a tFileInputDelimited for each file. For example if you had two files named file1.csv and file2.csv, right click the tFileList and choose Trigger>Run If. In the run if condition type ((String)globalMap.get("tFileList_1_CURRENT_FILE")).toLowerCase().matches("file1.csv") and drag it to the tFileInputDelimited set up to handle file1.csv. Do the same for file2.csv, changing the filename in the run if condition.

Spark saveAsTextFile to Azure Blob creates a blob instead of a text file

I am trying to save an RDD to a text file. My instance of Spark is running on Linux and connected to Azure Blob
val rdd = sc.textFile("wasb:///HdiSamples/HdiSamples/SensorSampleData/hvac/HVAC.csv")
//find the rows which have only one digit in the 7th column in the CSV
val rdd1 = rdd.filter(s => s.split(",")(6).length() == 1)
rdd1.saveAsTextFile("wasb:///HVACOut")
When I look at the output, it is not as a single text file but as a series of application/octet-stream files in a folder called HVACOut.
How can I output it as a single text file instead?
Well I am not sure you can get just one file without a directory. If you do
rdd1 .coalesce(1).saveAsTextFile("wasb:///HVACOut")
you will get one file inside a directory called "HVACOut" the file should like something like part-00001. This is because your rdd is a disturbed on in your cluster with what they call partitions. When you do a call to save (all save functions) it is going to make a file per partition. So by call coalesce(1) your telling you want 1 partition.
Hope this helps.
After finished provisioning a Apache Spark cluster on Azure HDInsight, you can go to the built-in Jupyter notebook for your cluster at: https://YOURCLUSTERNAME.azurehdinsight.net/jupyter.
There you will find sample notebook with example on how to do this.
Specifically, for scala, you can go to the notebook named "02 - Read and write data from Azure Storage Blobs (WASB) (Scala)".
Copying some of the code and comments here:
Note:
Because CSV is not natively supported by Spark, so there is no built-in way to write an RDD to a CSV file. However, you can work around this if you want to save your data as CSV.
Code:
csvFile.map((line) => line.mkString(",")).saveAsTextFile("wasb:///example/data/HVAC2sc.csv")
Hope this helps!