Redshift Merge rows and conflict-resolve by timestamp - amazon-redshift

This is unlike the selecting row with latest timestamp question and is specific to Redshift
I want to allow users to update parts of a (staging) table row at different points in time while avoiding invoking UPDATE statements. This is done by an append-only approach where we keep adding rows where only the unique id and a timestamp is mandatory and the other columns may or may not have a value provided.
Question:
Given a table where apart from "primary key" (not truely enforced) and a timestamp column, all other columns in that table are nullable, how do I merge all rows that have the same primary key into one row by picking the most recent non-null value for each nullable columns, if one such non-null value exists.
Example:
|id|timestamp|status|stringcol|numcol|
|1 |456 |begin | | |
|1 |460 | | | 2 |
|2 |523 | | foo | |
|1 |599 |mid | blah | |
|2 |624 |begin | | |
|1 |721 |done | | 60 |
should produce
|id|timestamp|status|stringcol|numcol|
|2 |624 |begin | foo | |
|1 |721 |done | blah | 60 |

This can be achieved using Redshift's LISTAGG function combined with SPLIT_PART function.
LISTAGG concatenates all values in a group into a single string, optionally allow you to order to contatenation and provide a delimiter.
SPLIT_PART splits a string by a delimiter and returns the chosen part
Using the above example 5-column table, you would need something like this:
SELECT id,
MAX(last_updated),
SPLIT_PART(LISTAGG(status, ',') WITHIN GROUP(ORDER BY last_updated DESC), ',', 1),
SPLIT_PART(LISTAGG(stringcol, ',') WITHIN GROUP(ORDER BY last_updated DESC), ',', 1),
SPLIT_PART(LISTAGG(numcol, ',') WITHIN GROUP(ORDER BY last_updated DESC), ',', 1)
FROM table
GROUP BY 1;

Related

Handling ksqlDB v0.11 composite key (tables) to replicate in MySQL using JDBC SInk connector

I'm using ksqlDB version 0.11 (I cannot upgrade to newer versions at the moment), and willing to replicate a TABLE data into MySQL using JDBC Sink connector. ksqlDB v0.11 does not support multiple TABLE keys, and my data needs to be grouped using multiple GROUP BY expression.
Using this statement I create the table:
CREATE TABLE estads AS SELECT
STID AS stid,
ASIG AS asig,
COUNT(*) AS np,
MIN(NOTA) AS min,
MAX(NOTA) AS max,
AVG(NOTA) AS med,
LATEST_BY_OFFSET(FECHREG) AS fechreg
FROM estads_stm GROUP BY stid, asig EMIT CHANGES;
The resulting table has the following schema:
Name : ESTADS
Field | Type
---------------------------------------------
KSQL_COL_0 | VARCHAR(STRING) (primary key)
NP | BIGINT
MIN | DOUBLE
MAX | DOUBLE
MED | DOUBLE
FECHREG | VARCHAR(STRING)
As you can see, the two primary keys (stid and asig) has been merged into a field called KSQL_COL_0, which is the expected behavior for version 0.11. The problem is that I need to use JDBC Sink connector to replicate the data into a MySQL table with the following schema:
+---------+--------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+---------+--------------+------+-----+-------------------+-----------------------------+
| stid | varchar(15) | NO | PRI | NULL | |
| asig | varchar(10) | NO | PRI | NULL | |
| np | smallint(6) | YES | | NULL | |
| min | decimal(5,2) | YES | | NULL | |
| max | decimal(5,2) | YES | | NULL | |
| med | decimal(5,2) | YES | | NULL | |
| fechreg | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
+---------+--------------+------+-----+-------------------+-----------------------------+
I don't know how to "unmerge" the automatically generated KSQL_COL_0 in order to tell JDBC that both stid and asig are primary keys in the MySQL table. Any ideas how to manage this? I know that since ksqlDB version 0.15 this is no longer a problem, as ksqlDB tables support multiple keys, but as I said, upgrading is not an option in my case.
Thanks!
I figured it out.
Basically you need to use AS_VALUE() clause in the table creation query. This way you copy the value of both private keys in new columns while also have the newly created private key in its own column. Then, simply specify in the JCBD Sink Connector to get the values of all the columns except the newly created private key.
CREATE TABLE estads AS SELECT
STID AS k1,
ASIG AS k2,
AS_VALUE(STID) AS stid,
AS_VALUE(ASIG) AS asig,
COUNT(*) AS np,
MIN(NOTA) AS min,
MAX(NOTA) AS max,
AVG(NOTA) AS med,
LATEST_BY_OFFSET(FECHREG) AS fechreg
FROM estads_stm GROUP BY k1, k2 EMIT CHANGES;

DB2 add column, insert data and new id

Each month, I want to record meter readings in order to see trends over time, and also want to add any new meters to my history table. I would like to add a new column name each month based on date.
I know how to concatenate data in a query, but have not found a way to do the same thing when adding a column. If today is 06/14/2018, I want the column name to be Y18M06, as I plan to run this monthly.
Something like this to add the column (this doesn't work)
ALTER TABLE METER.HIST
ADD COLUMN ('Y' CONCAT VARCHAR_FORMAT(CURRENT TIMESTAMP, 'YY') CONCAT 'M' CONCAT VARCHAR_FORMAT(CURRENT TIMESTAMP, 'MM'))
DECIMAL(12,5) NOT NULL DEFAULT 0
Then, I want to insert data into that new column from another table. In this case, a list of meter id's, and the new column contains a meter reading. If a new id exists, then it also needs to be added.
Source: CURRENT Destination: HISTORY
Current Desired
+----+---------+ +----+---------+ +----+---------+---------+
| id | reading | | id | Y18M05 | | id | Y18M05 | Y18M06 |
+----+---------+ +----+---------+ +----+---------+---------+
| 1 | 321.234 | | 1 | 121.102 | | 1 | 121.102 | 321.234 |
+----+---------+ +----+---------+ +----+---------+---------+
| 2 | 422.634 | | 2 | 121.102 | | 2 | 121.102 | 422.634 |
+----+---------+ +----+---------+ +----+---------+---------+
| 3 | 121.456 | | 3 | | 121.456 |
+----+---------+ +----+---------+---------+
Any help would be much appreciated!
Don't physically add columns. Rather pivot the data on-the fly
https://www.ibm.com/developerworks/community/blogs/SQLTips4DB2LUW/entry/pivoting_tables56?lang=en
Adding columns is not a good idea. From a conceptional point and modelling point think about adding rows for each month. You have limited columns but more less unlimited number of rows and this will give you a oermanent model / table structure.

Use column values to build up query

I have a table log containing columns schema_name & table_name & object_id & data and the table can contain records with different table_names and schema_names:
| schema_name | table_name | object_id | data |
| ------------- |-------------|-------------|-------------|
| bio | sample |5 |jsonb |
| bio | location |8 |jsonb |
| ... | ... |... |jsonb |
I want to execute a query as followed:
select schema_name,
table_name,
object_id,
(select some_column from schema_name.table_name where id = object_id)
from log
PS: id is a column that exists in every table (sample, location, ...)
Is their a way in postgreSQL to use the values in columns to build up a query (so that schema_name and table_name is filled in based on the values of the columns)?

Make row data into column headers while grouping

I'm trying to group up on a multiple rows of similar data and convert differentiated row data into columns on Amazon Redshift. Easier to explain with an example ->
Starting Table
+-------------------------------------------+
|**Col1** | **Col2** | **Col3** | **Col 4** |
| x | y | A | 123 |
| x | y | B | 456 |
+-------------------------------------------+
End result desired
+-------------------------------------------+
|**Col1** | **Col2** | **A** | **B** |
| x | y | 123 | 456 |
+-------------------------------------------+
Essentially grouping by Column 1 and 2, and the entries in Column 3 become the new column headers and the entries in Column 4 become the entries for the new columns.
Any help super appreciated!
There is no native functionality, but you could do something like:
SELECT
COL1,
COL2,
MAX(CASE WHEN COL3='A' THEN COL4 END) AS A,
MAX(CASE WHEN COL3='B' THEN COL4 END) AS B
FROM table
GROUP BY COL1, COL2
You effectively need to hard-code the column names. It's not possible to automatically define columns based on the data.
This is standard SQL - nothing specific to Amazon Redshift.

Join column with timestamps where value is maximum

I have a table that looks like
+-------+-----------+
| value | timestamp |
+-------+-----------+
and I'm trying to build a query that gives a result like
+-------+-----------+------------+------------------------+
| value | timestamp | MAX(value) | timestamp of max value |
+-------+-----------+------------+------------------------+
so that the result looks like
+---+----------+---+----------+
| 1 | 1.2.1001 | 3 | 1.1.1000 |
| 2 | 5.5.1021 | 3 | 1.1.1000 |
| 3 | 1.1.1000 | 3 | 1.1.1000 |
+---+----------+---+----------+
but I got stuck on joining the column with the corresponding timestamps.
Any hints or suggestions?
Thanks in advance!
For further information (if that helps):
In the real project the max-values are grouped by month and day (with group by clause, which works btw), but somehow I got stuck on joining the timestamps for max-values.
EDIT
Cross joins are a good idea, but I want to have them grouped by month e.g.:
+---+----------+---+----------+
| 1 | 1.1.1101 | 6 | 1.1.1300 |
| 2 | 2.6.1021 | 5 | 5.6.1000 |
| 3 | 1.1.1200 | 6 | 1.1.1300 |
| 4 | 1.1.1040 | 6 | 1.1.1300 |
| 5 | 5.6.1000 | 5 | 5.6.1000 |
| 6 | 1.1.1300 | 6 | 1.1.1300 |
+---+----------+---+----------+
EDIT 2
I've added a fiddle for some sample data and and example of the current query.
http://sqlfiddle.com/#!1/efa42/1
How to add the corresponding timestamp to the maximum?
Try a cross join with two sub queries, the first one selects all records, the second one gets one row that represents the time_stamp of the max value, <3;"1000-01-01"> for example.
SELECT col_value,col_timestamp,max_col_value, col_timestamp_of_max_value FROM table1
cross join
(
select max(col_value) max_col_value ,col_timestamp col_timestamp_of_max_value from table1
group by col_timestamp
order by max_col_value desc
limit 1
) A --One row that represents the time_stamp of the max value, ie: <3;"1000-01-01">
Use the window cause you use with pg
Select *, max( value ) over (), max( timestamp ) over() from table
That gives you the max values from all values in every row
http://www.postgresql.org/docs/9.1/static/tutorial-window.html