I have yearly data in my excel file in such format:
Country \ Years 1980 1981 ... 2010
Abkhazia 234 334 ... 456
Afghanistan 466 789 ... 732
...
Here is picture
And I want my data transform to 3 different tables and load it to postgres database.
Tables should look something like that
First table - country:
id | name
1 | Abkhazia
2 | Afghanistan
Second table dates:
id | date
1 | 1980
2 | 1981
And third is a table where all data is stored depending on country and date:
country_id date_id data
1 1 234
1 2 334
2 1 466
2 2 789
... ... ...
Any ideas how I could achieve my goal?
Assuming the source excel structure is as below (i have custom built this):
There are basically 3 parts to your question. I break down the transformation into part for better understanding:
1. Loading Table - Country
This is pretty straight forward based on the data given in the excel. Simply take an
Excel Input >> Add a sequence step. Give the Sequence name as Country ID >> Select only the Country Name and Country ID >> Load into the Country Table using Table Output.
2. Loading Table - Year:
The idea here is to display the Year ID in Row wise format instead of the columns given the excel source data. PDI version 5 and above provides you with a very useful step called Metadata Structure. This step allows you to get the structure of your table. In this case, we need to have the year columns pulled, ignoring the country column.
Follow the steps as below:
Read the Excel Data >> Get the Metadata structure of your source >> Filter Out the Country Column (which is available in row at position=1) >> Add a Sequence Number. Name it YearID >> Finally Load the Year Table.
3. Loading the Final Table - Country and Year along with Data:
The way to display all the column data values to a row level in PDI is using Row Normalizer step. Use this step to display a normalized output. Now follow the below steps:
Read the Excel source data >> use Row Normalizer Step to normalize the rows based on the Years >> Do a Stream Lookup with the Above Country and Year tables to fetch the CountryID and YearID respectively >> Finally Load the necessary column data into Table Output
Hope it helps :)
I have placed the codes in github repo along with the data file which i have used. Its here.
Also, just realized that i have given the wrong naming conventions as per your question. Consider date_id as YearID and instead of id's i have given countryid and yearid.
Related
I am using Talend free version
I have below requirement:
My source is MS ACCESS; table SRC_CUST.
SRC_CUST
CUST_ID CUST_NAME
101 ABC
102 LMN
My target is .csv file TGT_CUST
Requirement: I am using tAccessInput component for MS Access table and I want to load that table into .csv file. My columns are varying day to day on daily basis.
Day 1: SRC_CUST has 2 columns CUST_ID and CUST_NAME so I need to load as it is into .csv file
Day 2: SRC_CUST has 3 columns CUST_ID, CUST_NAME, CUST_ADD so on day 2 I need to load these 3 columns without changing any code, means, I need to achieve column change dynamically .
Note: I am using Talend free version so I neither use any dynamic component nor dynamic data type. I cannot even add columns in "Edit schema" under basic settings of tAccessInput component because my columns are varying.
Please help me for same.
Thanks,
Vaishali Shinde
I am not sure what the technical term for what I am trying to do is.
Hoping raw data and output below will clearly define the use case.
Raw data :
This is what my raw data looks like
Output 1 :
this is what I am trying extract first
Here I am trying to get a table where the first column has the name of the guests and 2nd column has the count of times they have featured in the table as a guest.
Output 2 :
this what I am trying extract next
Here I am trying to map months against names and see how many nights one has collected in which month.
One way to achieve this would be to create a temp table with 5 columns,column 1 with Guest names,
column 2 with count of occurrence in guest 1 column in raw data table,
column 3 with count of occurrence in guest 2 column in raw data table,
column 4 with count of occurrence in guest 3 column in raw data table,
column 5 with total of previous 3 columns.
But I am trying to find a proper solution through tableau, if possible. Because this way would not help me achieve Output 2.
Plain text raw data if you'd like to work on it :
booking by,Guest 1,Guest 2,Guest 3,stay start,stay end,hotel code
Ram,Seema,Ram,,May 1 2018,May 2 2018,BBST
Karan,Ram,Seema,,May 6 2018,May 7 2018,BRRLY
Mahesh,Mahesh,Seema,Ram,June 2 2018,June 4 2018,BBST
Krishna,Krishna,,,June 2 2018,June 3 2018,BRRLY
Seema,Seema,,,June 7 2018,June 8 2018,BRRLY
I have this data in Tableau:
KPI_NAME Value Date
------------------------
A 2 1-Jan
B 4 1-Jan
A 6 2-Jan
B 7 2-Jan
and I want it like this:
A B Date
------------------------
2 4 1-Jan
6 7 2-Jan
So I want it to convert each distinct value in the column KPI_NAME to a separate row, this can be done in the visualization part in Tableau but I want to do that in the data preparation because I want to use it in calculated field
Any help is appreciated.
Most tableau functionality is designed to consume more granular, flattened, and tidy data in the form of your first set. As such, the data prep functionality has a feature to unpivot column values into rows. I don't believe that reverse functionality is built into the data prep capability in the same way.
Not knowing your end use case, potentially a work around would be to:
Create a calculated field with an IF statement to return the value
when record is listed as A, otherwise return NULL.
Although you will still have the same number of records, you should be able to perform many of the calculations available with this type of data structure
Alternatively, you could perform you pivot outside of Tableau.
I have three data sets:
First, called education.dta. It contains individuals(students) over many years with their achieved educations from yr 1990-2000. Originally it is in wide format, but I can easily reshape it to long. It is presented as wide under:
id educ_90 educ_91 ... educ_00 cohort
1 0 1 1 87
2 1 1 2 75
3 0 0 2 90
Second, called graduate.dta. It contains information of when individuals(students) have finished high school. However, this data set do not contain several years only a "snapshot" of the individ when they finish high school and characteristics of the individual students such as backgroung (for ex parents occupation).
id schoolid county cohort ...
1 11 123 87
2 11 123 75
3 22 243 90
The third data set is called teachers.dta. It contains informations about all teachers at high school such as their education, if they work full or part time, gender... This data set is long.
id schoolid county year education
22 11 123 2011 1
21 11 123 2001 1
23 22 243 2015 3
Now I want to merge these three data sets.
First, I want to merge education.dta and graduate.dta on id.
Problem when education.dta is wide: I manage to merge education and graduation.dta. Then I make a loop so that all the variables in graduation.dta takes the same over all years, for eksample:
forv j=1990/2000 {
gen county j´=.
replace countyj´=county
}
However, afterwards when reshaping to long stata reposts that variable id does not uniquely identify the observations.
further, I have tried to first reshape education.dta to long, and thereafter merge either 1:m or m:1 with education as master, using graduation.dta.
However stata again reposts that id is not unique. How do I deal with this?
In next step I want to merge the above with teachers.dta on schoolid.
I want my final dataset in long format.
Thanks for your help :)
I am not certain that I have exactly the format of your data, it would be helpful if you gave us a toy dataset to look at using dataex (and could even help you figure out the problem yourself!)
But to start, because you are seeing that id is not unique, you need to figure out why there might be multiple ids in any of the datasets. Can someone in graduate.dta or education.dta appear more than once? help duplicates will probably be useful to explore the data in this way.
Because you want your dataset in long format I suggest reshaping education.dta to long first, then doing something like merge m:1 id using "graduate.dta" (once you figure out why some observations are showing up more than once) and then, finally something like merge 1:1 schoolid year using "teacher.dta" and you will have your final dataset.
I'm new to KDB ( sorry if this question is dumb). I'm creating the following table
q)dsPricing:([id:`int$(); date:`date$()] open:`float$();close:`float$();high:`float$();low:`float$();volume:`int$())
q)dsPricing:([id:`int$(); date:`date$()] open:`float$();close:`float$();high:`float$();low:`float$();volume:`int$())
q)`dsPricing insert(123;2003.03.23;1.0;3.0;4.0;2.0;1000)
q)`dsPricing insert(123;2003.03.24;1.0;3.0;4.0;2.0;2000)
q)save `:dsPricing
Let's say after saving I exit. After starting q, I like to add another pricing item in there without loading the entire file because the file could be large
q)`dsPricing insert(123;2003.03.25;1.0;3.0;4.0;2.0;1500)
I've been looking at .Q.dpft but I can't really figure it out. Also this table/file doesn't need to be partitioned.
Thanks
You can upsert with the file handle of a table to append on disk, your example would look like this:
`:dsPricing upsert(123;2003.03.25;1.0;3.0;4.0;2.0;1500)
You can load the table into your q session using get, load or \l
q)get `:dsPricing
id date | open close high low volume
--------------| --------------------------
123 2003.03.23| 1 3 4 2 1000
123 2003.03.24| 1 3 4 2 2000
123 2003.03.25| 1 3 4 2 1500
.Q.dpft will save a table splayed(one file for each column in the table and a .d file containing column names) with a parted attribute(p#) on one of the symbol columns. Any symbol columns will also be enumerated by .Q.en.