What component can be used to duplicate every row of an excel file using Talend? - talend

If I have an excel file with rows like this:
val1 | val2 | val3 | val4
val5 | val6 | val7 | val8
then I need the result to be this:
val1 | val2 | val3 | val4
val1 | val2 | val3 | val4
val5 | val6 | val7 | val8
val5 | val6 | val7 | val8
Is this possible with Talend?
EDIT: Notice the order of the rows. I need them to maintain order.

For a pure duplication, the easiest would be to use a tHashInput to store the values coming from your Excel file.
Then you can read from a linked tHashOutput twice and join the flows with a tUnite.
If you need to keep the order, you can add a tJavaRow or a tMap before the tHashInput to add a column "order" valued with a sequence.
Then you can add a tSortRow after the tUnite and order with the new column.
Finally, you delete the extra column with a tFilterColumn (or any other component).
Result :
Code for the order :
Numeric.sequence("s1",1,1);
Note : you might have to add the components tHashOutput and tHashInput to your palette as they are not included by default.

Send 2 identical inputs to a tUnite to duplicate the row. Then send the rows to a tSort to sort them.
The 2 tFlowInput are identical, replace them with what you have.
Sync Columns on the tJoin.
Set the columns to sort on the tSort
Output :
.---------+----------+----------+----------.
| tLogRow_1 |
|=--------+----------+----------+---------=|
|newColumn|newColumn1|newColumn2|newColumn3|
|=--------+----------+----------+---------=|
|val1 |val2 |val3 |val3 |
|val1 |val2 |val3 |val3 |
|val5 |val6 |val7 |val8 |
|val5 |val6 |val7 |val8 |
'---------+----------+----------+----------'

Related

how to get the minor of three column values in postgresql

The common function to get the minor value of a column is min(column), but what I want to do is to get the minor value of a row, based on the values of 3 columns. For example, using the following base table:
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| 2 | 1 | 3 |
| 10 | 0 | 1 |
| 13 | 12 | 2 |
+------+------+------+
I want to query it as:
+-----------+
| min_value |
+-----------+
| 1 |
| 0 |
| 2 |
+-----------+
I found a solution as follows, but for SQL, not Postgresql. So I am not getting it to work in postgresql:
select
(
select min(minCol)
from (values (t.col1), (t.col2), (t.col3)) as minCol(minCol)
) as minCol
from t
I could write something using case statement but I would like to write a query like the above for postgresql. Is this possible?
You can use least() (and greatest() for the maximum)
select least(col1, col2, col3) as min_value
from the_table

Spark scala finding value in another dataframe

Hello I'm fairly new to spark and I need help with this little exercise. I want to find certain values in another dataframe but if those values aren't present I want to reduce the length of each value until I find the match. I have these dataframes:
----------------
|values_to_find|
----------------
| ABCDE |
| CBDEA |
| ACDEA |
| EACBA |
----------------
------------------
| list | Id |
------------------
| EAC | 1 |
| ACDE | 2 |
| CBDEA | 3 |
| ABC | 4 |
------------------
And I expect the next output:
--------------------------------
| Id | list | values_to_find |
--------------------------------
| 4 | ABC | ABCDE |
| 3 | CBDEA | CBDEA |
| 2 | ACDE | ACDEA |
| 1 | EAC | EACBA |
--------------------------------
For example ABCDE isn't present so I reduce its length by one (ABCD), again it doesn't match any so I reduce it again and this time I get ABC, which matches so I use that value to join and form a new dataframe. There is no need to worry about duplicates values when reducing the length but I need to find the exact match. Also, I would like to avoid using a UDF if possible.
I'm using a foreach to get every value in the first dataframe and I can do a substring there (if there is no match) but I'm not sure how to lookup these values in the 2nd dataframe. What's the best way to do it? I've seen tons of UDFs that could do the trick but I want to avoid that as stated before.
df1.foreach { values_to_find =>
df1.get(0).toString.substring(0, 4)}
Edit: Those dataframes are examples, I have many more values, the solution should be dynamic... iterate over some values and find their match in another dataframe with the catch that I need to reduce their length if not present.
Thanks for the help!
You can load the dataframe as temporary view and write the SQL. Is the above scenario you are implementing for the first time in Spark or already did in the previous code ( i mean before spark have you implemented in the legacy system). With Spark you have the freedom to write udf in scala or use SQL. Sorry i don't have solution handy so just giving a pointer.
the following will help you.
val dataDF1 = Seq((4,"ABC"),(3,"CBDEA"),(2,"ACDE"),(1,"EAC")).toDF("Id","list")
val dataDF2 = Seq(("ABCDE"),("CBDEA"),("ACDEA"),("EACBA")).toDF("compare")
dataDF1.createOrReplaceTempView("table1")
dataDF2.createOrReplaceTempView("table2")
spark.sql("select * from table1 inner join table2 on table1.list like concat('%',SUBSTRING(table2.compare,1,3),'%')").show()
Output:
+---+-----+-------+
| Id| list|compare|
+---+-----+-------+
| 4| ABC| ABCDE|
| 3|CBDEA| CBDEA|
| 2| ACDE| ACDEA|
| 1| EAC| EACBA|
+---+-----+-------+

Fast split Spark dataframe by keys in some column and save as different dataframes

I have Spark 2.3 very big dataframe like this:
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AB | 2 | 1 |
| AA | 2 | 3 |
| AC | 1 | 2 |
| AA | 3 | 2 |
| AC | 5 | 3 |
-------------------------
I need to "split" this dataframe by values in col_key column and save each splitted part in separate csv file, so I have to get smaller dataframes like
-------------------------
| col_key | col1 | col2 |
-------------------------
| AA | 1 | 2 |
| AA | 2 | 3 |
| AA | 3 | 2 |
-------------------------
and
-------------------------
| col_key | col1 | col2 |
-------------------------
| AC | 1 | 2 |
| AC | 5 | 3 |
-------------------------
and so far.
Every result dataframe I need to save as different csv file.
Count of keys is not big (20-30) but total count of data is (~200 millions records).
I have the solution where in the loop is selected every part of data and then saved to file:
val keysList = df.select("col_key").distinct().map(r => r.getString(0)).collect.toList
keysList.foreach(k => {
val dfi = df.where($"col_key" === lit(k))
SaveDataByKey(dfi, path_to_save)
})
It works correct, but bad issue of this solution is that every selection of data by every key couse full passing through whole dataframe, and it get too many time.
I think must be faster solution, where we pass through dataframe only once and during this put every record to "rigth" result dataframe (or directly to separate file). But I don't know how can to do it :)
May be, someone have ideas about it?
Also I prefer to use Spark's DataFrame API because it provides fastest way of data processing (so using RDD's is not desirable, if possible).
You need to partition by column and save as csv file. Each partition save as one file.
yourDF
.write
.partitionBy("col_key")
.csv("/path/to/save")
Why don't you try this ?

DB2 add column, insert data and new id

Each month, I want to record meter readings in order to see trends over time, and also want to add any new meters to my history table. I would like to add a new column name each month based on date.
I know how to concatenate data in a query, but have not found a way to do the same thing when adding a column. If today is 06/14/2018, I want the column name to be Y18M06, as I plan to run this monthly.
Something like this to add the column (this doesn't work)
ALTER TABLE METER.HIST
ADD COLUMN ('Y' CONCAT VARCHAR_FORMAT(CURRENT TIMESTAMP, 'YY') CONCAT 'M' CONCAT VARCHAR_FORMAT(CURRENT TIMESTAMP, 'MM'))
DECIMAL(12,5) NOT NULL DEFAULT 0
Then, I want to insert data into that new column from another table. In this case, a list of meter id's, and the new column contains a meter reading. If a new id exists, then it also needs to be added.
Source: CURRENT Destination: HISTORY
Current Desired
+----+---------+ +----+---------+ +----+---------+---------+
| id | reading | | id | Y18M05 | | id | Y18M05 | Y18M06 |
+----+---------+ +----+---------+ +----+---------+---------+
| 1 | 321.234 | | 1 | 121.102 | | 1 | 121.102 | 321.234 |
+----+---------+ +----+---------+ +----+---------+---------+
| 2 | 422.634 | | 2 | 121.102 | | 2 | 121.102 | 422.634 |
+----+---------+ +----+---------+ +----+---------+---------+
| 3 | 121.456 | | 3 | | 121.456 |
+----+---------+ +----+---------+---------+
Any help would be much appreciated!
Don't physically add columns. Rather pivot the data on-the fly
https://www.ibm.com/developerworks/community/blogs/SQLTips4DB2LUW/entry/pivoting_tables56?lang=en
Adding columns is not a good idea. From a conceptional point and modelling point think about adding rows for each month. You have limited columns but more less unlimited number of rows and this will give you a oermanent model / table structure.

LibreOffice - RANDBETWEEN return a name

I got two columns list like this
+----+-------+
| Nr | Name |
+----+-------+
| 1 | Alice |
| 2 | Bob |
| 3 | Joe |
| 4 | Ann |
| 5 | Jane |
+----+-------+
And would like to generate a random name from this list.
For now I am only able to randomly select a number and then manually pick out the corresponding name - using this function =RANDBETWEEN(A2;A10) How can I pick out the name instead?
Assuming that the data of your table are in cells E7:F11 the following code can do what you need:
=VLOOKUP(RANDBETWEEN(1;5);E7:F11;2)
Further, in case you need to create a random permutation of the names you may also use the Calc extension Permutate at https://sourceforge.net/projects/permutate/.
Hope that helps.
Assuming your data is with Nr in A1 I suggest:
=INDEX(B$2:B$6;RANDBETWEEN(1;5))
then there is no need for the Nr column in making the selection.