Talend shuffle the order of the columns - talend

I was trying to achieve merging all the rows of a file into columns based on a certain sequence number. This has been achieved by tpivotToColumnDelimited.( this has to be done , cannot be changed ).
But after using that, the column ordering has been changed.
Is there any way of reading a file according to a schema and writing the file according to some other schema in talend ? ( Basically shuffling the column ordering in a file )
I had tried using setting tdynamicschema from input and output but was not able to read and write the data properly.
Any help would be highly appreciated.

I had solved the issue.
Simply added a column which had the index number read from the file and before using the tpivotToColumnDelimited , i had used that column dynamically to sort the results and write to a tmp file and then with the help of tpivotToColumnDelimited , it is now according to the input schema.

Related

Hive - the correct way to permanently change the date and type in the entire column

I would be grateful if someone could explain here step by step what the process of changing the date format and column type from string to date should look like in the table imported via Hive View to HDP 2.6.5.
The data source is the well-known MovieLens 100K Dataset set ('u.item' file) from:
https://grouplens.org/datasets/movielens/100k/
$ hive --version is: 1.2.1000.2.6.5.0-292
Date format for the column is: '01-Jan-1995'
Data type of column is: 'string'
ACID Transactions is 'On'
Ultimately, I would like to convert permanently the data in the entire column to the correct Hive format 'yyyy-MM-dd' and next column type to 'Date'.
I have looked at over a dozen threads regarding similar questions before. Of course, the problem is not to display the column like this, it can be easily done using just:
SELECT from_unixtime(unix_timestamp(prod_date,'dd-MMM-yyyy'),'yyyy-MM-dd') FROM moviesnames;
The problem is to finally write it down this way. Unfortunately, this cannot be done via UPDATE in the following way, despite the inclusion of atomic operations in Hive config.
UPDATE moviesnames SET prodate = (select to_date(from_unixtime(UNIX_TIMESTAMP(prod_date,'dd-MMM-yyyy'))) from moviesnames);
What's the easiest way to achieve the above using Hive-SQL? By copying and transforming a column or an entire table?
Try this:
UPDATE moviesnames SET prodate = to_date(from_unixtime(UNIX_TIMESTAMP(prod_date,'dd-MMM-yyyy')));

Pivot data in Talend

I have some data which I need to pivot in Talend. This is a sample:
brandname,metric,value
A,xyz,2
B,xyz,2
A,abc,3
C,def,1
C,ghi,6
A,ghi,1
Now I need this data to be pivoted on the metric column like this:
brandname,abc,def,ghi,xyz
A,3,null,1,2
B,null,null,null,2
C,null,1,6,null
Currently I am using tPivotToColumnsDelimited to pivot the data to a file and reading back from that file. However having to store data on an external file and reading back is messy and unnecessary overhead.
Is there a way to do this with Talend without writing to an external file? I tried to use tDenormalize but as far as I understand, it will return the rows as 1 column which is not what I need. I also looked for some 3rd party component in TalendExchange but couldn't find anything useful.
Thank you for your help.
Assuming that your metrics are fixed, you can use their names as columns of the output. The solution to do the pivot has two parts: first, a tMap that transposes the value of each input-row in into the corresponding column in the output-row out and second, a tAggregate that groups the map's output-rows according to the brandname.
For the tMap you'd have to fill the columns conditionally like this, example for output colum named "abc":
out.abc = "abc".equals(in.metric)?in.value:null
In the tAggregate you'd have to group by out.brandname and aggregate each column as sum ignoring nulls.

How to merge multiple output from single table using talend?

I have Employee table in database which is having gender column. so I want to Filter Employee data based on number of gender with three column to excel like:
I'm getting this output using below talend schemaStructure 1:
So I want to optimized above structure and trying in this way but I have been stuck with other scenario. Here I'm getting Employee data with gender wise but in three different file so Is there any way so that I can achieve same excel result from one SQL input file and after mapping can be get in a single output excel file?
Structure 2 :
NOTE: I don't want to use same input table many time. I want to get same output using single table and single output excel file. so please suggest me any component which one is useful for me.
Thanks in advance!!!

Redshift - Adding a column, do we have to change our previous CSVs to include it?

I currently have a redshift table in our database that has 10 columns, and I want to add another. It's trivial to do an alter table to do this.
My question - When I do this, will all my old CSV files fail to insert into redshift (via COPY from S3) given they won't have this new column?
I was hoping the columns would just be NULL vs. it failing on import, but I haven't seen any documentation on this.
Ideally I wish I could specify the actual column name in the header row of the CSV, but I haven't seen if that is possible anywhere.
FILLRECORD in COPY command does that: 'Allows data files to be loaded when contiguous columns are missing at the end of some of the records'.

OpenRowSet command in TSQL is returning NULLS

Been investigating for a while now and keep hitting a brick wall. I am importing from xls files into temp tables via the OpenRowset command. Now I have a problem where I’m trying to import a certain column has a range values but the most common are the following. Columns structured as long numbers i.e. 15598 and the some columns as strings i.e. 15598-E.
Now the openrowset is reading the string version no problem but is reporting the number version as a NULL. I read (http://www.sqldts.com/254.aspx ) that openrowset has that issue and the author speaks of implementing “HDR=YES;IMEX=1” into the query string but that’s not working for me at all.
Have any of you guys every encountered this?
Just some more info as well. I may not do this with the JET engine (Microsoft.Jet.OLEDB.4.0) so this is what my query looks like:
SELECT *
FROM
OPENROWSET('MSDASQL'
, 'Driver=Microsoft Excel Driver (*.xls);HDR=YES;IMEX=1;DBQ=C:\ImportFile.xls;'
, 'SELECT * FROM [Sheet1$]')
I notice you are using the Excel ODBC driver. Have you tried the JET OLEDB Provider with the equivalent connection string?
select * from openrowset(
'Microsoft.Jet.OLEDB.4.0',
'Data Source=C:\ImportFile.xls;Extended Properties="Excel 8.0;HDR=Yes;IMEX=1"',
'SELECT * FROM [Sheet1$]')
EDIT: Sorry, just noticed your last paragraph. Surely the Excel ODBC driver still goes via the JET engine, so what difference would it make?
EDIT: I have looked at the KB194124 link, and the registry values it recommends are the default values on my machine, which I have never changed. I have used the above method several times myself without problems. Maybe it's an environmental issue?
If you don't mind opening the file in Excel, take the columns that have the problem, select the column, and do
Data -> Text to Columns -> Next -> Next -> Text
Save the spreadsheet and they should all come in as Text in OPENROWSET
I've found using .CSV files instead of Excel, opened by setting up a Linked Server, and setting up the format of the files in schema.ini a more practical approach for handling imports like this, with that method you can explicitly choose each column's format.
We've come across the same issue. Unfortunately we've not found a solution either. There's more information here which indicates that there might be a registry fix.
I had the same problem. I fixed it cuting and pasting a row that contains a column with the string/numeric value (for example 123ABC) in the first row position of the sheet. For some reason T-SQL reads the first row and assumes that all the values are numeric.
Response by SqlACID in this link worked great [https://wikigurus.com/Article/Show/185717/OpenRowSet-command-in-TSQL-is-returning-NULLS] :-
If you don't mind opening the file in Excel, take the columns that have the problem, select the column, and do
Data -> Text to Columns -> Next -> Next -> Text
Save the spreadsheet and they should all come in as Text in OPENROWSET
I've found using .CSV files instead of Excel, opened by setting up a Linked Server, and setting up the format of the files in schema.ini a more practical approach for handling imports like this, with that method you can explicitly choose each column's format.