Is it possible to combine multiple input files with different schemas using Schema Drift / Dynamic Columns - azure-data-factory

I have around 20 tab-separated input files. They have in the region of 500 columns, but each will be slightly different.
The sink output schema is known and will contain all the possible input columns.
As a simplified example:
File 1
Name
Age
DOB
Nationality
Bob
21
01/01/1972
British
File2
Name
Nationality
NINO
Joe
British
AA995654A
File 3
Name
DOB
Nationality
Sam
01/01/1990
British
Is it possible to have one DataFlow with multiple inputs, where the schema is not known until runtime, that would cope with changes in the input files and in this case would output:
Name
Age
DOB
NINO
Nationality
Bob
21
01/01/1972
NULL
British
Joe
NULL
NULL
AA995654A
British
Sam
NULL
01/01/1990
NULL
British
I have looked at column pattern matching and schema drift, but don't see how/if it is possible to achieve this.

What you will do is to build a logical model in your data flow using a Derived Column with the common model that you wish to conform your input data to. This video shows an example of achieving this: https://www.youtube.com/watch?v=K5tgzLjEE9Q

Related

Atomic values / divisibility to reach 1NF

After reading about normalization I am unsure of how to interpreter the 1 NF requirements
According to wikipedia, something is in first normal form, if the "domain of each attribute contains only atomic indivisible values"
My question is: Who decides what is indivisible or not?
You may divide a date datatype into year, month, day, second, nanoseconds. You may aswell divide an adress into the exact latitude coordinates. When can you really be sure that you have reached 1NF?
Would this table be considered 1NF?
fullName
fullAdresss
Joe Zowesson
87th Victoria Street London EC96 1MB, 14584
Mason Hamburg
47th Jeremy Street London EC26 1MB, 13584
Dedrik Terry
27th Burger Street London EC16 1MB, 17584
My interpretation here is that the value Joe Zowesson is indivisible in regards to the column fullName. And that both zip code, street number and street name is atomic in relation to the column name fullAddress.
I am almost certain that I am in the wrong, but I can not yet understand why.
The question is in regards to an upcoming exam, where I will need to "proove" which normal form something currently is in. Something that I find very hard depending on how you interpreter the word atomic.
You have misunderstood the concept of 1NF basically. By atomic value, it is meant that when you have a column for Name, you should not store any other values alongside it. In other words, the column intended for the Name should not store ID, Address or anything else together with Name, so that when you query the column Name you get only Name, and not name with Id or Address. And Name can be in any form you want whether it be First name + Last name or First name + Last name + Middle name + Previous name.
The decision of whether you need separate columns for the related data should be made during design. Let's suppose you have table Student:
StudentId
FullName
Address
Average grade
1
John Done
New York, US
3.4
2
Robert Bored
New York, US
0
3
Student LName
Dallas, US
1
4
Another LName
Munich, Germany
2
In this case, it means that you do not write queries and don't need data based on First name, Last name separately, but you need all at once for example:
SELECT FullName
FROM Student
WHERE StudentId = 1;
John Done
And when you need First name, Last name separately, you decompose them into several columns, for example:
StudentId
FullName
LastName
Address
Average grade
1
John
Done
New York, US
3.4
2
Robert
Bored
New York, US
0
3
Student
LName
Dallas, US
1
4
Another
LName
Munich, Germany
2
And your queries might look like this:
SELECT LastName, AverageGrade
FROM Student
WHERE AverageGrade >= 1 AND FirstName != 'John';
The result will be:
| LastName | AverageGrade |
---------------------------
| LName | 1 |
| LName | 2 |
Or something like this maybe:
UPDATE Student
SET AverageGrade = 4
WHERE LastName = 'LName' AND FirstName != 'Student'
Basically, the decision depends on how you manipulate the data and in which form you need it.
To sum it up. Whether the relation is in 1NF or not depends on what values you're trying to store on this table, as I mentioned above, one column should store only one type of value, e.g ID, Address, Name, etc. And the decision of how your columns' values will look depends on the design and how you NEED TO STORE the data. If you do not need to query fistname, middlename, lastname, secondname separately, then what you can do is just save all of them in one column FullName and it will still be in 1NF. But if you need them separately, you can store them in separate columns, and again it will still be in 1NF, but it might violate other rules.
Here are some tutorials you might find useful: https://www.studytonight.com/dbms/first-normal-form.php
Let the application, and how it will be used, guide you as to what data should be split further into additional fields (or not).
For example;
If, in your application, you are constantly splitting first name from last name so that you can say "Hi Joe" on correspondence, you should split fullName into two fields. Conversely, If you had two fields firstName and lastName, and were always concatenating them so that you could correctly address an envelope, it would make more sense to have those two fields stored in a single column in your table.
In practice, it is not uncommon for a database to show some de-normalization with the above example given how common both scenarios are but the risk is that they get out of sync if someone updates first name (for example) but doesn't update fullName.
Consider things like how you will force your users to follow a certain pattern if you decide to go with a single column fullName. How would you prevent "Smith, Joe" if your application needed "Joe Smith"?
Dates are another good example and again, whether you split the parts into separate columns depends on how they will be used.
A datetime field which indicates when a row was inserted probably doesn't need to be split out, but if you had many queries which were only interested in the year (for example), it might make sense to split it out.
This only scratches the surface which is why this answer is more about how to think about the underlying problem. Yes normalizing your database is important for all kinds of reasons, but how far you go with it depends on how your data will be used at the end of the day.

Is this table in first normal form?

I am currently studying SQL normal forms.
Lets say I have the following table the primary key is userid
userid FirstName LastName Phone
1 John Smith 555-555
1 Tim Jack 432-213
2 Sarah Mit 454-541
3 Tom jones 987-125
The book I'm reading states the following conditions must be true in order for a table to be in 1st normal form.
Rows contain data about an entity.
Columns contain data about attributes of the entities.
All entries in a column are of the same kind.
Each column has a unique name.
Cells of the table hold a single value.
The order of the columns is unimportant.
The order of the rows is unimportant.
No two rows may be identical.
A primary key Must be assigned
I'm not sure if my table violates the
8th rule No two rows may be identical.
Because the first two records in my table
1 John Smith 555-555
1 Tim Jack 432-213
share the same userid does that mean that they are considered
duplicate rows?
Or does duplicate records mean that every peace of data in the row
has to be the same for the record to be considered a duplicate row
see example below?
1 John Smith 555-555
1 John Smith 555-555
EDIT1: Sorry for the confusion
The question I was trying to ask is simple
Is this table below in 1st normal form?
userid FirstName LastName Phone
1 John Smith 555-555
1 Tim Jack 432-213
2 Sarah Mit 454-541
3 Tom jones 987-125
Based on the 9 rules given in the textbook I think it is but I wasn't sure that
if rule 8 No two rows may be identical
was being violated because of two records that use the same primary key.
The class text book and prof isn't really that clear on this subject which is why I am asking this question.
Or does duplicate records mean that every peace of data in the row has to be the same for the record to be considered a duplicate row see example below?
They mean that--the latter of your choices. Entire rows are what must be "identical". It's ok if two rows share the same values for one or more columns as long as one or more columns differ.
That's because a relation holds a set of values that are tuples/rows/records, and set is a collection of values that are all different.
But SQL & some relational algebras have different notions of "identical" in the case of NULLs compared to the relational model without NULLs. You should read what your textbook says about it if you want to know exactly what they mean by it. Two rows that have NULL in the same column are considered different. (Point 9 might be summarizing something involving NULLs. Depends on the explanation in the book.)
PS
There's no single notion of what a relation is. There is no single notion of "identical". There is no single notion of 1NF.
Points 3-8 are better described as (poor) ways of restricting how to interpret a picture of a table to get a relation. Your textbook seems to be (strangely) making "1NF" a property of such an interpretation of a picture of a table. Normally we simply define a relation to be a certain thing so if you have one then it has to have the defined properties. Then "in 1NF" applies to a relation & either means "is a relation" & isn't further used or it means certain further restrictions hold. A relation is a set of tuples/rows/records, and in the kind of relation your 3-8 describes they are sets of attribute/column/field name-value pairs & the values paired with a name have to be of the type paired with that name in some schema/heading that is a set of name-type pairs that is defined either as part of the relation or external to it.
Your textbook doesn't seem to present things clearly. It's definition of "1NF" is also idiosyncratic in that although 3-8 are mathematical, 1 & 2 are informal/heuristic (& 9 could be either or both).

Logic to convert string of words to number

I am looking for a logic which will help me in coverting a string to number in teradata and hive.
It should be easily implementable in Tearadata as I dont have permission to deploy a UDF in TD. In hive if it is not simple I can easily write a UDF.
My requirement - Lets say I have columns sender_country, receiver country. I want to generate a number for concat('sender_country','_','receiver_country')
The number should always be same if the countries appear again.
Below is the illustration
UID sender_country receiver_country concat number
1 US UK US_UK 198760
2 FR IN FR_IN 146785
3 CH RU CH_RU 467892
4 US UK US_UK 198760
It should be in a way where all unique combinations of a country should have unique values. Like in above example US_US is repeated, it has same corresponding number.
I tried hashbucket(hashrow('concat')) in TD, but don't know its equivalent implementation in hive.
Similarly we have hash() function in hive, but don't have its equivalent function in TD.
I could not find any hash functions which returns similar values in TD and Hive too
You can simply convert each character into a number:
Ascii(Substr(sender_country,1,1))*1000000+
Ascii(Substr(sender_country,2,1))*10000+
Ascii(Substr(receiver_country,1,1))*100+
Ascii(Substr(receiver_country,2,1))
returns 85838575 for US,UK

Pentaho spoon transformation from excel file

I have yearly data in my excel file in such format:
Country \ Years 1980 1981 ... 2010
Abkhazia 234 334 ... 456
Afghanistan 466 789 ... 732
...
Here is picture
And I want my data transform to 3 different tables and load it to postgres database.
Tables should look something like that
First table - country:
id | name
1 | Abkhazia
2 | Afghanistan
Second table dates:
id | date
1 | 1980
2 | 1981
And third is a table where all data is stored depending on country and date:
country_id date_id data
1 1 234
1 2 334
2 1 466
2 2 789
... ... ...
Any ideas how I could achieve my goal?
Assuming the source excel structure is as below (i have custom built this):
There are basically 3 parts to your question. I break down the transformation into part for better understanding:
1. Loading Table - Country
This is pretty straight forward based on the data given in the excel. Simply take an
Excel Input >> Add a sequence step. Give the Sequence name as Country ID >> Select only the Country Name and Country ID >> Load into the Country Table using Table Output.
2. Loading Table - Year:
The idea here is to display the Year ID in Row wise format instead of the columns given the excel source data. PDI version 5 and above provides you with a very useful step called Metadata Structure. This step allows you to get the structure of your table. In this case, we need to have the year columns pulled, ignoring the country column.
Follow the steps as below:
Read the Excel Data >> Get the Metadata structure of your source >> Filter Out the Country Column (which is available in row at position=1) >> Add a Sequence Number. Name it YearID >> Finally Load the Year Table.
3. Loading the Final Table - Country and Year along with Data:
The way to display all the column data values to a row level in PDI is using Row Normalizer step. Use this step to display a normalized output. Now follow the below steps:
Read the Excel source data >> use Row Normalizer Step to normalize the rows based on the Years >> Do a Stream Lookup with the Above Country and Year tables to fetch the CountryID and YearID respectively >> Finally Load the necessary column data into Table Output
Hope it helps :)
I have placed the codes in github repo along with the data file which i have used. Its here.
Also, just realized that i have given the wrong naming conventions as per your question. Consider date_id as YearID and instead of id's i have given countryid and yearid.

Counting multiple values from one column in Tableau

I have a field from the data I am reading in that can contain multiple values. They are essentially tags.
For example, there could be a column called "persons responsible". This could read "Joe; Bob; Sue" or "Sue" for a given row.
Is it possible from within Tableau to read these in as separate categories? So that for this sample data:
Project | Persons
---------------------------
Zeta | Bob; Sue; Joe
Enne | Sue
Doble Ve | Bob
There could be a count of Bob (2), Sue (2), Joe (1)?
I am working on getting better data inputs, but I was wondering if there was a temporary solution at this level.
I would definitely work towards normalizing your schema.
In the meantime, there is a workaround that is almost reasonable if there is a small set of possible values for the tags (persons in your example).
If Bob, Sue and Joe are the only people in the system, you can use the contains() function to define a boolean calculated field for each person -- e.g. Bob_Is_Responsible = contains(Persons, 'Bob"), and similar fields for Sue and Joe. Then you could use those as building blocks, possibly with sets, to break the data up in different ways.
Of course, this approach gets cumbersome fast if the number of tags grows, or if it is unconstrained. But you asked for a temporary solution ...
If the number of elements is small, you write and union several queries with each one having the project and nth element.
Ideally, you'd reshape your data to look like this either in the database or with the above mentioned union technique. Then you could count() or countd() the elements by project.
Project | Persons
---------------------------
Zeta | Bob
Zeta | Sue
Zeta | Joe
Enne | Sue
Doble Ve | Bob