ADF Using Union to combine data - azure-data-factory

I'm trying to combine data from 2 data sources in ADF. The combining of data is working correctly but not in the correct order. I want to do this using union.
Below is my dataflow containing the union.
Source1 contains 1 row of data. Whereas, source2 contains multiple rows of data. When combining these rows using union this is done in a random order. However, I want the 1 row in source1 to be the first row in the sink output. Anyone know how to do this? I've tried adding the union to source1 instead but this doesn't work either.
source2 data example:
1110,555,666,1
1130,345,876,5
source1 data example:
uniquekey,number,id,position
Current Output:
1110,555,666,1
1130,345,876,5
uniquekey,number,id,position
Desired Ouput:
uniquekey,number,id,position
1110,555,666,1
1130,345,876,5

I tried to repro your issue, where I got expected output as required. But when I tried to run this pipeline. It was generating two different files same as Source and Sink.
So I performed below steps to get required output.
Source1 file.
Source2 file.
Union Configuration as follows:
Union settings tab
Optimize tab
Inspect tab
Sink Configuration:
Do not provide filename in Sink Configuration.
In Sink configuration use Single Partition in Optimize tab.
Keep mapping in Sink Configuration as shown below:
Expected Output:

Related

How to add header to file in Azure Data Factory

I am storing the header in a CSV file and concatenating it with the data file using mapping data flow.
I am using union Activity to combine these two files. While combining the header file and data file, I can see the data but header data is not at the top. It's randomly present in the sink file.
How can I make the header at top ?
1,Company_Reference_ID,PH2_TIC,USD,PH2_Actuals,2021-03-12,PH2_X_VB_V3,85738,V3Premium
1,Company_Reference_ID,PH2_TIC,USD,PH2_Actuals,2021-03-12,PH2_X_VB_V3,85738,V3Premium
Journey,CompanyRerenceIDType,CompaReferenceID,Currecy,Ledgerype,Accountinate,Journaource
1,Company_Reference_ID,PH2_TIC,USD,PH2_Actuals,2021-03-12,PH2_X_VB_V3,85738,V3Commission
Update:
My debug result is as follows, I think it is what you want:
I created a simple test to merge two csv files. One header.csv and another vlaues.csv.
As #Mark Kromer MSFT said we can use Surrogate Key and then sort these rows.The Row_No of heard.csv will start from 1 and values.csv will start from 2.
Set header source to the header.csv and don't select First row as header.
Set header source to the values.csv and don't select First row as header.
At SurrogateKey1 activity , enter Row_No as Key column and 1 as Start value.
At SurrogateKey2 activity , enter Row_No as Key column and 2 as Start value.
Then we can uion SurrogateKey1 stream and SurrogateKey2 stream at Union1 activity.
Then we can sort these rows by Row_No at Sort1 activity.
We can use Select1 activity to filter Row_No column.
I think it is what you want:
For now, you would need to use a Surrogate Key for the different streams and make sure that the header row has 1 for the surrogate key value and sort by that column.
We are working on a feature for adding a header to the delimited text sink as a property in the data flow Sink. That will make it much easier and should light-up in the UI soon.

Column defined in source Dataset could not be found in the actual source

I have an ADF Copy Data flow and I'm getting the following error at runtime:
My source is defined as follows:
In my data set, the column is defined as shown below:
As you can see from the second image, the column IsLiftStation is defined in the source. Any idea why ADF cannot find the column?
I've had the same error. You can solve this by either selecting all columns (*) in the source and then mapping those you want to the sink schema, or by 'clearing' the mapping in which case the ADF Copy component will auto map to columns in the sink schema (best if columns have the same names in source and sink). Either of these approaches works.
Unfortunately, clicking the import schema button in the mapping tab doesn't work. It does produce the correct column mappings based on the columns in the source query but I still get the original error 'the column could not be located in the actual source' after doing this mapping.
could you check that is there a column named 'ae_type_id' in your schema? If that's the case, could you remove that column and try again? The columns in the schema must be aligned with columns in the query.
The issue is caused by an incomplete schema in one of the data sources. My solution is:
Step through the data flow selecting the first schema, Import projection
Go to the flow and Data Preview
Repeat for each step.
In my case, there were trailing commas in one of the CSV files. This caused automated column names to be created in the import allowing me to fix the data file.

Pivot data in Talend

I have some data which I need to pivot in Talend. This is a sample:
brandname,metric,value
A,xyz,2
B,xyz,2
A,abc,3
C,def,1
C,ghi,6
A,ghi,1
Now I need this data to be pivoted on the metric column like this:
brandname,abc,def,ghi,xyz
A,3,null,1,2
B,null,null,null,2
C,null,1,6,null
Currently I am using tPivotToColumnsDelimited to pivot the data to a file and reading back from that file. However having to store data on an external file and reading back is messy and unnecessary overhead.
Is there a way to do this with Talend without writing to an external file? I tried to use tDenormalize but as far as I understand, it will return the rows as 1 column which is not what I need. I also looked for some 3rd party component in TalendExchange but couldn't find anything useful.
Thank you for your help.
Assuming that your metrics are fixed, you can use their names as columns of the output. The solution to do the pivot has two parts: first, a tMap that transposes the value of each input-row in into the corresponding column in the output-row out and second, a tAggregate that groups the map's output-rows according to the brandname.
For the tMap you'd have to fill the columns conditionally like this, example for output colum named "abc":
out.abc = "abc".equals(in.metric)?in.value:null
In the tAggregate you'd have to group by out.brandname and aggregate each column as sum ignoring nulls.

How to write csv file into one file by pyspark

I use this method to write csv file. But it will generate a file with multiple part files. That is not what I want; I need it in one file. And I also found another post using scala to force everything to be calculated on one partition, then get one file.
First question: how to achieve this in Python?
In the second post, it is also said a Hadoop function could merge multiple files into one.
Second question: is it possible merge two file in Spark?
You can use,
df.coalesce(1).write.csv('result.csv')
Note:
when you use coalesce function you will lose your parallelism.
You can do this by using the cat command line function as below. This will concatenate all of the part files into 1 csv. There is no need to repartition down to 1 partition.
import os
test.write.csv('output/test')
os.system("cat output/test/p* > output/test.csv")
Requirement is to save an RDD in a single CSV file by bringing the RDD to an executor. This means RDD partitions present across executors would be shuffled to one executor. We can use coalesce(1) or repartition(1) for this purpose. In addition to it, one can add a column header to the resulted csv file.
First we can keep a utility function for make data csv compatible.
def toCSVLine(data):
return ','.join(str(d) for d in data)
Let’s suppose MyRDD has five columns and it needs 'ID', 'DT_KEY', 'Grade', 'Score', 'TRF_Age' as column Headers. So I create a header RDD and union MyRDD as below which most of times keeps the header on top of the csv file.
unionHeaderRDD = sc.parallelize( [( 'ID','DT_KEY','Grade','Score','TRF_Age' )])\
.union( MyRDD )
unionHeaderRDD.coalesce( 1 ).map( toCSVLine ).saveAsTextFile("MyFileLocation" )
saveAsPickleFile spark context API method can be used to serialize data that is saved in order save space. Use pickFile to read the pickled file.
I needed my csv output in a single file with headers saved to an s3 bucket with the filename I provided. The current accepted answer, when I run it (spark 3.3.1 on a databricks cluster) gives me a folder with the desired filename and inside it there is one csv file (due to coalesce(1)) with a random name and no headers.
I found that sending it to pandas as an intermediate step provided just a single file with headers, exactly as expected.
my_spark_df.toPandas().to_csv('s3_csv_path.csv',index=False)

Output Sequence while writing to HDFS using Apache Spark

I am working on a project in apache Spark and the requirement is to write the processed output from spark into a specific format like Header -> Data -> Trailer. For writing to HDFS I am using the .saveAsHadoopFile method and writing the data to multiple files using the key as a file name. But the issue is the sequence of the data is not maintained files are written in Data->Header->Trailer or a different combination of three. Is there anything I am missing with RDD transformation?
Ok so after reading from StackOverflow questions, blogs and mail archives from google. I found out how exactly .union() and other transformation works and how partitioning is managed. When we use .union() the partition information is lost by the resulting RDD and also the ordering and that's why My output sequence was not getting maintained.
What I did to overcome the issue is numbering the Records like
Header = 1, Body = 2, and Footer = 3
so using sortBy on RDD which is union of all three I sorted it using this order number with 1 partition. And after that to write to multiple file using key as filename I used HashPartitioner so that same key data should go into separate file.
val header: RDD[(String,(String,Int))] = ... // this is my header RDD`
val data: RDD[(String,(String,Int))] = ... // this is my data RDD
val footer: RDD[(String,(String,Int))] = ... // this is my footer RDD
val finalRDD: [(String,String)] = header.union(data).union(footer).sortBy(x=>x._2._2,true,1).map(x => (x._1,x._2._1))
val output: RDD[(String,String)] = new PairRDDFunctions[String,String](finalRDD).partitionBy(new HashPartitioner(num))
output.saveAsHadoopFile ... // and using MultipleTextOutputFormat save to multiple file using key as filename
This might not be the final or most economical solution but it worked. I am also trying to find other ways to maintain the sequence of output as Header->Body->Footer. I also tried .coalesce(1) on all three RDD's and then do the union but that was just adding three more transformation to RDD's and .sortBy function also take partition information which I thought will be same, but coalesceing the RDDs first also worked. If Anyone has some another approach please let me know, or add more to this will be really helpful as I am new to Spark
References:
Write to multiple outputs by key Spark - one Spark job
Ordered union on spark RDDs
http://apache-spark-user-list.1001560.n3.nabble.com/Union-of-2-RDD-s-only-returns-the-first-one-td766.html -- this one helped a lot