How to append columns to other columns in Talend? - talend

I have two hash inputs, each has completely different columns. Say hashInput_1 has columns called
One | Two | Three | Four | Five
and hashInput_2 has columns called:
Six | Seven | Eight
Each hash input has the same number of rows. I just need to combine them into one excel document or flow so that the columns and all of the data gets correctly joined. I know talend can append rows, but I need to append columns so that the new schema is columns One | Two | Three | Four | Five | Six | Seven | Eight with all of the data in the appropriate columns.

You need to add one more column ID in each of the schema of hashinput.
In the lookup increment this column with a sequence. (See in tjavarow code).
In the tmap increment another sequence and do a join betwen the two ID column.
So first line of one hashinput (ID 1) will join first line other hashinput (ID 1).

Related

Postgres aggregate function for selecting best / first element in a group

I have a table of elements with a number of items that need de-duping based on a priority. The following is a grossly simplified but representative example:
sophia=> select * from numbers order by value, priority;
value | priority | label
-------+----------+-------
1 | 1 | One
1 | 2 | Eins
2 | 1 | Two
2 | 2 | Zwei
3 | 2 | Drei
4 | 1 | Four
4 | 2 | Vier
(7 rows)
I want to restrict this to returning only a single row per number. Easy enough, I can use the first() aggregate function detailed in https://wiki.postgresql.org/wiki/First/last_(aggregate)
sophia=> select value, first(label) from numbers group by value order by value;
value | first
-------+-------
1 | One
2 | Two
3 | Drei
4 | Four
(4 rows)
sophia=>
The problem with this is that the order isn't well defined so if the DB rows were inserted in a different order, I might get this:
sophia=> select value, first(label) from numbers group by value order by value;
value | first
-------+-------
1 | Eins
2 | Zwei
3 | Drei
4 | Vier
(4 rows)
Of course, the solution to that also seems simple, in that I could just do an order by:
sophia=> select value, first(label) from (select * from numbers order by priority) foo group by value order by value;
value | first
-------+-------
1 | One
2 | Two
3 | Drei
4 | Four
(4 rows)
sophia=>
However, the problem here is that the query optimizer is free to discard the order by rules in subqueries meaning that this doesn't always work and breaks in random nasty places.
I have a solution that I'm currently using in a handful of places that relies on array_agg.
sophia=> select value, (array_agg(label order by priority))[1] as best_label from numbers group by value;
value | best_label
-------+------------
1 | One
2 | Two
3 | Drei
4 | Four
(4 rows)
sophia=>
This provides robust ordering but involves creating a bunch of extra arrays during query time that just get thrown away and hence the performance on larger datasets rather sucks.
So the question is, is there a better, cleaner, faster way of dealing with this?
Your last attempt includes the answer to your question, you just didn't realise it:
array_agg(label order by priority)
Note the order by clause inside the aggregate function. This isn't special to array_agg, but is a general part of the syntax for using aggregate functions:
Ordinarily, the input rows are fed to the aggregate function in an unspecified order. In many cases this does not matter; for example, min produces the same result no matter what order it receives the inputs in. However, some aggregate functions (such as array_agg and string_agg) produce results that depend on the ordering of the input rows. When using such an aggregate, the optional order_by_clause can be used to specify the desired ordering. The order_by_clause has the same syntax as for a query-level ORDER BY clause, as described in Section 7.5, except that its expressions are always just expressions and cannot be output-column names or numbers.
Thus the solution to your problem is simply to put an order by inside your first aggregate expression:
select value, first(label order by priority) from numbers group by value order by value;
Given how elegant this is, I'm surprised that first and last are still not implemented as built-in aggregates.
The Postgres select statement has a clause called DISTINCT ON which is extremely useful in the case when you would like to return one of a group. In this case, you would use:
SELECT DISTINCT ON (value) value, label
FROM numbers
ORDER BY value, priority;
Using DISTINCT ON is generally faster than other methods involving groups or window functions.

Does PostgreSQL have a way of creating metadata about the data in a particular table?

I'm dealing with a lot of unique data that has the same type of columns, but each group of rows have different attributes about them and I'm trying to see if PostgreSQL has a way of storing metadata about groups of rows in a database or if I would be better off adding custom columns to my current list of columns to track these different attributes. Microsoft Excel for instance has a way you can merge multiple columns into a super-column to group multiple columns into one, but I don't know how this would translate over to a PostgreSQL database. Thoughts anyone?
Right, can't upload files. Hope this turns out well.
Section 1 | Section 2 | Section 3
=================================
Num1|Num2 | Num1|Num2 | Num1|Num2
=================================
132 | 163 | 334 | 1345| 343 | 433
......
......
......
have a "super group" of columns (In SQL in general, not just postgreSQL), the easiest approach is to use multiple tables.
Example:
Person table can have columns of
person_ID, first_name, last_name
employee table can have columns of
person_id, department, manager_person_id, salary
customer table can have columns of
person_id, addr, city, state, zip
That way, you can join them together to do whatever you like..
Example:
select *
from person p
left outer join student s on s.person_id=p.person_id
left outer join employee e on e.person_id=p.person_id
Or any variation, while separating the data into different types and PERHAPS save a little disk space in the process (example if most "people" are "customers", they don't need a bunch of employee data floating around or have nullable columns)
That's how I normally handle this type of situation, but without a practical example, it's hard to say what's best in your scenario.

Versioning in the database

I want to store full versioning of the row every time a update is made for amount sensitive table.
So far, I have decided to use the following approach.
Do not allow updates.
Every time a update is made create a new
entry in the table.
However, I am undecided on what is the best database structure design for this change.
Current Structure
Primary Key: id
id(int) | amount(decimal) | other_columns
First Approach
Composite Primary Key: id, version
id(int) | version(int) | amount(decimal) | change_reason
1 | 1 | 100 |
1 | 2 | 20 | correction
Second Approach
Primary Key: id
Uniqueness Index on [origin_id, version]
id(int) | origin_id(int) | version(int) | amount(decimal) | change_reason
1 | NULL | 1 | 100 | NULL
2 | 1 | 2 | 20 | correction
I would suggest a new table which store unique id for item. This serves as lookup table for all available items.
item Table:
id(int)
1000
For the table which stores all changes for item, let's call it item_changes table. item_id is a FOREIGN KEY to item table's id. The relationship between item table to item_changes table, is one-to-many relationship.
item_changes Table:
id(int) | item_id(int) | version(int) | amount(decimal) | change_reason
1 | 1000 | 1 | 100 | NULL
2 | 1000 | 2 | 20 | correction
With this, item_id will never be NULL as it is a valid FOREIGN KEY to item table.
The best method is to use Version Normal Form (vnf). Here is an answer I gave for a neat way to track all changes to specific fields of specific tables.
The static table contains the static data, such as PK and other attributes which do not change over the life of the entity or such changes need not be tracked.
The version table contains all dynamic attributes that need to be tracked. The best design uses a view which joins the static table with the current version from the version table, as the current version is probably what your apps need most often. Triggers on the view maintain the static/versioned design without the app needing to know anything about it.
The link above also contains a link to a document which goes into much more detail including queries to get the current version or to "look back" at any version you need.
Why you are not going for SCD-2 (Slowly Changing Dimension), which is a rule/methodology to describe the best solution for your problem. Here is the SCD-2 advantage and example for using, and it makes standard design pattern for the database.
Type 2 - Creating a new additional record. In this methodology, all history of dimension changes is kept in the database. You capture attribute change by adding a new row with a new surrogate key to the dimension table. Both the prior and new rows contain as attributes the natural key(or other durable identifiers). Also 'effective date' and 'current indicator' columns are used in this method. There could be only one record with the current indicator set to 'Y'. For 'effective date' columns, i.e. start_date, and end_date, the end_date for current record usually is set to value 9999-12-31. Introducing changes to the dimensional model in type 2 could be very expensive database operation so it is not recommended to use it in dimensions where a new attribute could be added in the future.
id | amount | start_date |end_date |current_flag
1 100 01-Apr-2018 02-Apr-2018 N
2 80 04-Apr-2018 NULL Y
Detail Explanation::::
Here, all you need to add the 3 extra column, START_DATE, END_DATE, CURRENT_FLAG to track your record properly. When the first time record inserted # source, this table will be store the value as:
id | amount | start_date |end_date |current_flag
1 100 01-Apr-2018 NULL Y
And, when the same record will be updated then you have to update the "END_DATE" of the previous record as current_system_date and "CURRENT_FLAG" as "N", and insert the second record as below. So you can track everything about your records. as below...
id | amount | start_date |end_date |current_flag
1 100 01-Apr-2018 02-Apr-2018 N
2 80 04-Apr-2018 NULL Y

Using filtered results as field for calculated field in Tableau

I have a table that looks like this:
+------------+-----------+---------------+
| Invoice_ID | Charge_ID | Charge_Amount |
+------------+-----------+---------------+
| 1 | A | $10 |
| 1 | B | $20 |
| 2 | A | $10 |
| 2 | B | $20 |
| 2 | C | $30 |
| 3 | C | $30 |
| 3 | D | $40 |
+------------+-----------+---------------+
In Tableau, how can I have a field that SUMs the Charge_Amount for the Charge_IDs B, C and D, where the invoice has a Charge_ID of A? The result would be $70.
My datasource is SQL Server, so I was thinking that I could add a field (called Has_ChargeID_A) to the SQL Server Table that tells if the invoice has a Charge_ID of A, and then in Tableau just do a SUM of all the rows where Has_ChargeID_A is true and Charge_ID is either B, C or D. But I would prefer if I can do this directly in Tableau (not this exactly, but anything that will get me to the same result).
Your intuition is steering you in the right direction. You do want to filter to only Invoices that contain row with a Charge_ID of A, and you can do this directly in Tableau.
First place Invoice_ID on the filter shelf, then select the Condition tab for the filter. Then select the "By formula" option on the condition tab and enter the formula you wish to use to determine which invoice_ids are included by the filter.
Here is a formula for your example:
count(if Charge_ID = 'A' then 'Y' end) > 0
For each data row, it will calculate the value of the expression inside the parenthesis, and then only include invoice_ids with at least one non-null value for the internal expression. (The implicit else for the if statement, "returns" null).
The condition tab for a dimension field equates to a HAVING clause in SQL.
If condition formulas get complex, it's often a good a idea to define them with a calculated field -- or a combination of several simpler calculated fields, just to keep things manageable.
Finally, if you end up working with sets of dimensions like this frequently, you can define them as sets. You can still drop sets on the filter shelf, but then can reuse them in other ways: like testing set membership in a calculated field (like a SQL IN clause), or by creating new sets using intersection and union operators. You can think of sets like named filters, such as the set of invoices that contain type A charge.

Can I set column attributes for a kdb partitioned table?

Is it possible to set column attributes for a partitioned table?
q)h "update `g#ticker from `pmd"
'par
q)h "update `s#ts from `pmd"
'par
q)
Should I set the attributes on the memory table, before I run the partitioning? Will the attributes be preserved after the partitioning?
Take a look at the setattrcol in dbmaint.q. This script is very very useful when working with partitioned databases.
In order for the partitions on disk to be sorted, you need to iterate through the partitions and use xasc as follows:
for each partition .. assuming you have a quote table partitioned by date to sort by `timestamp
{`timestamp xasc `$":./2014.04.20/quote/"}
Once you've finished sorting each partition, the s attribute will appear ontimestamp column..
q)meta quote
c | t f a
---------| -----
date | d
timestamp| p s
pair | s
side | c
...