Azure Data Factory merge 2 csv files with different schema - merge

I am trying to merge the 2 csv files(in Azure data factory) which has different schema. Below is the scenario
CSV 1: 15 columns -> say 5 dimensions and 10 metrices(x1, x2,...x10)
CSV 2: 15 columns -> 5 dimensions(same as above) and 10 metrices(different from above, y1, y2...y10)
So my schema is different. Now I have to merge both CSV files so that only 5 dimensions comes with all 20 metrices.
I tried with Data Transformation using Select operation. That is giving me 2 rows in the merged file. One row with first 5 dimensions and 10 metrices and second row with next 5 dimensions and 10 metrices, which is incorrect as I am looking only for one row with 5 dimensions and all 20 metrics(x1,x2...x10, y1,y2...y10)
Any help is much appreciated on this issue

Thank you #sac for the update and thank you #Joel Cochran, for the suggestion. Posting it as an answer to help other community members.
Use Join transformation and Join type as Inner join. Use Key columns or common columns (dimension columns) from 2 Input files as your Join condition. This will output all columns from file1 and file2.
Use Select transformation to get the required select list from the join output.
Refer below process for implementation:
(i) Join 2 source files, with inner join and key columns in the join condition.
(ii) Output of Join transformation will list all the columns from source1 and all columns from source2 (includes duplicate key columns from both source files).
(iii) Use select transformation and remove duplicate (or not required in select list) columns from the Join output.
(iv) Output of Select transformation.

Related

Combine two sources with different schemas in a single file keeping the schema per row in ADF Dataflow

i need to combine two sources into a single sink file with keeping the schema per row. Example:
File 1
Column 1
Column 2
Column 3
Column 4
A
B
C
D
File 2
Column 1
Column 2
J
K
Output File
A, B, C, D
J, K
No need to header row.
Each column separated by a comma
Each row keep it's structure/schema:
Thanks for help
#Kd85 As per your comment, You want to combine 2 CSV files and store output in .txt file.
If you are using Binary dataset as Sink in copy activity, you can only copy from Binary dataset.
Please Refer - https://learn.microsoft.com/en-us/azure/data-factory/format-binary
You can simply use a Union activity in your dataflow.
Choose Output to single file in the sink settings.
Result txt:

Exclude Combination of Data Items From One Table From Another

I have a view, A, with 20 columns which forms my primary data. I have a table B which lists some of the columns from A and contains data I want to exclude from A.
For example table B will have 6 columns 2 of which are 'customer' and 'country' and contain the data 'HP' and 'America'. These columns exist in A. But I want to write a query that brings back data from A except where any rows that have a combination HP and America.
There are 6 columns and table B can have any combination of rows. Anywhere between 1 and all 6 rows could be filled in or there could be a row which has 5 columns filled in. Also another row with a different 5 columns filled in and so on.
I want to be prepared for any possible combination of the 6 rows and the query to search A for the combination and exclude any rows with that data from B.
I have tried this
SELECT *
FROM A T1
WHERE not EXISTS
(SELECT * FROM [dbo].[ExcludedItems] T2
WHere ReportNumber=1
AND
(
T1.job=ISNULL(T2.job,T1.job) and T1.CustomerName=ISNULL(T2.CustomerName,T1.CustomerName) and
T1.COUNTRY= ISNULL(T2.COUNTRY,T1.COUNTRY) and T1.CONTINENT=ISNULL(T2.CONTINENT,T1.CONTINENT) AND
T1.continer= ISNULL(T2.ContainerName, T1.continer) and T1.UnscheduledJob= ISNULL(T2.unscheduledJob, T1.UnscheduledJob) and
T1.[Price]= ISNULL(T2.Price, T1.Price) and
T1.[Haulage]= ISNULL(T2.[Haulage], T1.[Haulage]) and
T1.SiteAdress= ISNULL(T2.SiteAddress, T1.SiteAdress) and T1.Delta=ISNULL(T2.Delta, T1.Delta) and
T1.Cost= ISNULL(T2.Cost, T1.Cost)
)
)
The problem is the result set is not correct. I have tried with a smaller column sample and able to exclude the correct combination of Customer and Country but when I introduce a 3rd or 4th column combination I can eyeball the result set and immediately see its incorrect. Not sure if I have to use multiple NOT EXISTS for each possible combination, was hoping not to.
A constraint is A has to be a view not a table. Otherwise I would have used variables in some manner and wrapped the whole thing in a stored procedure.
Appreciate any help, fall back is to manually add to the code each time an item combination is supplied in B!

Merge Columns from various files in Talend

I am trying to achieve column merge of files in a folder using Talend.(Files are local)
Example:- 4 files are there in a folder. ( there could be 'n' number of files also)
Each file would have one column having 100 values.
So after merge, the output file would have 4 or 'n' number of columns with 100 records in it.
Is it possible to merge this way using Talend components ?
Tried with 2 files in tmap , the output records becomes multiplied ( the record in first file * the record in second file ).
Any help would be appreciated.
Thanks.
You have to determine how to join data from the different files.
If row number N of each file has to be matched with row number N of the other files, then you must set a sequence on each of your file, and join the sequences in order to get your result. Careful, you are totally depending on the order of data in each file.
Then you can have this job :
tFileInputdelimited_1 --> tMap_1 --->{tMap_5
tFileInputdelimited_2 --> tMap_2 --->{tMap_5
tFileInputdelimited_3 --> tMap_3 --->{tMap_5
tFileInputdelimited_4 --> tMap_4 --->{tMap_5
In tMaps from 1 to 4, copy the input to the output, and add a "sequence" column (datatype integer) to your output, populate it with Numeric.sequence("IDENTIFIER1",1,1) . Then you have 2 columns in output : your data and a unique sequence.
Be careful to use different identifiers for each source.
Then in tMap_5, just join the different sequences, and get your inputColumn.

How to split 2 or more delimited columns in a single row to multiple rows using Talend

I am trying to move data from a CSV file to DB table. There are 2 delimited columns in the CSV file (separated by ";"). I would like to create a row for each of the delimited values at matching indexes as shown below. Assumption is that both columns will contain same number of delimited items.
Example CSV Input:
Labels Values
A;B;C 1;2;3
D 4
F;G 5;6
Expected Output:
Labels Values
A 1
B 2
C 3
D 4
E 5
F 6
How can I achieve this? I have tried using tNormalize but this only works for a single column. Also I tried 2 successive tNormalize nodes but as expected it resulted in unwanted combinations.
Thanks
Read your CSV file with a tfileinputdelimited, and
define your schema for the file.
Assuming you are using MySQL , also drop a tMysqlOutput component on you desinger to save your parsed file to the DB.

Need help building complex multi-table queries

This question is something that a lot of people learning bioinformatics and new to DNA data analysis are struggling with:
Lets say I have 20 tables with the same column headings. Each table represents a patient sample and each row represents a locus (site) which has mutated in that sample. Each site is uniquely identified by two columns together - chromosome number and base number (eg. 1 and 43535, 1 and 33456, 1 and 3454353). There are several columns which give different characteristics of each mutation including a column called Gene which gives the gene at that site.. Multiple sites can be mutated in a gene - meaning the Gene column can have the same value multiple times in one table.
I want to query all these tables at the same time by lets say Gene. I input a value from the Gene column and I want as output the names of all the tables (samples) in which the gene name is present in the Gene column and also the entire line(s) (preferably) for each sample so that I can compare the characteristics of the mutation in that gene across multiple samples on one output page.
I also want to input a number say 4 and want as output a list of genes which have mutated in at least 4 of 20 patients (list of genes whose names appear in the Gene column in atleast 4 of 20 tables).
What is the "easiest way" to do this? What is the "best way" assuming I want to make more flexible queries, besides these two?
I am a MD, do not have any particular software expertise but I am willing to put in the necessary time to build this query system. A few lines of code won't put me off..
Eg data:
Func Gene ExonicFunc Chr Start End Ref Obs
exonic ACTRT2 nonsynonymous SNV 1 2939346 2939346 G A
exonic EIF4G3 nonsynonymous SNV 1 21226201 21226201 G A
exonic CSMD2 nonsynonymous SNV 1 34123714 34123714 C T
This is just a third of the columns. Multiple columns were removed to fit the page size here...
Thank you.
Create a view that union's all the tables together. You should probably add additional information about which table ti comes from:
create view allpatients as
select 'a' as whichtable, t.*
from tableA t
union all
select 'b' as whichtable, t.*
from tableB t
...
You might find that it is easier to "instantiate" the view by creating a table with all patients. Just have a stored procedure that recreates the table by combining the 20 tables.
Alternatively, you could find that you have large individual tables (millions of rows). In this case, you would want to treat each of the original tables as a partition.
If what you have is a bunch of Excel files, you can import them all into the same table, with a distinct column for patient id. There is no need to create 20 different tables for this -- in fact, it would be a bad idea.
Once you do, go to Access' query design, SQL view and use these queries:
To create a query that returns all fields for the input gene name:
select *
from gene_data
where gene = [GeneName]
To create a query that returns gene names that are mutated in more than 4 samples:
select gene
from
(select gene, sample_id
from gene_data
group by gene, sample_id) g
group by gene
having count(sample_id) > 4
After this, change to design view -- you'll see how to create similar queries using the GUI.