Why min function is not recognized by ksql

Why min function is not recognized by ksql - apache-kafka

I'm using confluent to write a query to get the first timestamp in a 5 minute window of a kafka topic. Here's the query (I know it's not the pretty way to do it):
CREATE STREAM start_metric_value AS
select metric_value
FROM dataaggregaion
WINDOW TUMBLING (SIZE 5 MINUTE)
where metric_datetime_utc = MIN(TIMESTAMPTOSTRING(metric_datetime_utc, 'yyyy-MM-dd HH:mm:ss')) LIMIT 1;
but I have this error :
Code generation failed for Predicate: Can't find any functions with
the name 'MIN'. expression:(METRIC_DATETIME_UTC =
MIN(TIMESTAMPTOSTRING(METRIC_DATETIME_UTC, 'yyyy-MM-dd HH:mm:ss'))),
schema:ROWKEY STRING KEY, ID STRING, METRIC_NAME STRING,
METRIC_VALUE STRING, METRIC_DATETIME_UTC BIGINT, METRIC_INDEX
STRING, IANA_TIMEZONE STRING, PROCESSED_DATETIME_UTC BIGINT,
DATA_TYPE STRING, ASSET_TYPE STRING, ROWTIME BIGINT, ROWKEY
STRING Caused by: Can't find any functions with the name 'MIN'
can any one know how to solve this problem

Not 100% clear about what you're trying to achieve. See comment above on your question about adding more details to help people understand what you're trying to achieve.
That said, I can say....
The Min function is not being recognised for two reasons:
You're passing the output of TIMESTAMPTOSTRING to MIN, but MIN does not take a string.
You can't use a aggregate function in a WHERE clause.
The error message you're seeing looks like a bug. If it still exists on the latest version of ksqlDB you may want to raise an issue in the ksqlDB GitHub project.
Even correcting these two things you're query will still fail as windowing in ksqlDB requires an aggregation, so you'll need a GROUP BY.
If, for example, you wanted to capture the min metric_datetime_utc per metric_value for each 5 minute window, you could do so with:
CREATE TABLE start_metric_value AS
SELECT
metric_value,
MIN(metric_datetime_utc) as minTs
FROM dataaggregaion
WINDOW TUMBLING (SIZE 5 MINUTE)
GROUP BY metric_value;
This will create a windowed table, i.e. a table where the key is made up of metric_value and the WINDOWSTART time. minTs will store the minimum datetime seen.
Let's run some data through the query to understand what's happening:
Input:
rowtime | metric_value | metric_datetime_utc
--------|---------------|--------------------
1 | A | 3
2 | A | 4
3 | A | 2
4 | B | 5
300000 | A | 6
Output to the START_METRIC_VALUE topic might be (Note: metric_Value and windowStart will be stored in the Kafka record's key, while minTs will be in the value):
metric_value | windowStart | minTs
-------------|-------------|------
A | 0 | 3
A | 0 | 3
A | 0 | 2
B | 0 | 5
A | 300000 | 6
What is actually output to the topic will depend on your value of cache.max.bytes.buffering. Setting this to 0, turning off buffering, will see the above output. However, with buffering enabled some of the intermediate results may not be output to Kafka, though the final result for each window will remain the same. You can also control what is output to Kafka using the upcoming SUPPRESS functionality
The above solution gives you the min timestamp per metric_value. If you want a global minimum datetime seen per window, then you can GROUP BY a constant. Note, this routes all events to a single ksqlDB node, so it doesn't scale well as a solution. If scaling is an issue there are solutions, e.g. like first calculating the minimum metric_value and then post-processing this to find the global minimum.
CREATE TABLE start_metric_value AS
SELECT
1 as Key,
MIN(metric_datetime_utc) as minTs
FROM dataaggregaion
WINDOW TUMBLING (SIZE 5 MINUTE)
GROUP BY 1;
Note: syntax is correct for version 0.10 of ksqlDB. You may need to adjust for other versions.

Related

PySpark rolling operation with variable ranges

I have a dataframe looking like this
some_data | date | date_from | date_to
1234 |1-2-2020| 1-2-2020 | 2-2-2020
5678 |2-2-2020| 1-2-2020 | 2-3-2020
and I need to perform some operations on some_data based on time ranges that are different for every row, and stored in date_from and date_to. This is basically a rolling operation on some_data vs date, where the width of the window is not constant.
If the time ranges were the same, like always 7 days preceding/following, I would just do a window with rangeBetween. Any idea how I can still use rangeBetween with these variable ranges? I could really use the partitioning capability Window provides...
My current solution is:
a join of the table with itself to obtain a secondary/nested date column. at this point every date has the full list of possible dates
some wheres to select, for each primary date the proper secondary dates according to date_from and date_to
a groupby the primary date with agg performing the actual operation on the selected rows
But I am afraid this would not be very performant on large datasets. Can this be done with Window? Do you have a better/more performant suggestion?
Thanks a lot,
Andrea.

Graph in Grafana using Postgres Datasource with BIGINT column as time

I'm trying to construct very simple graph showing how much visits I've got in some period of time (for example for each 5 minutes).
I have Grafana of v. 5.4.0 paired well with Postgres v. 9.6 full of data.
My table below:
CREATE TABLE visit (
id serial CONSTRAINT visit_primary_key PRIMARY KEY,
user_credit_id INTEGER NOT NULL REFERENCES user_credit(id),
visit_date bigint NOT NULL,
visit_path varchar(128),
method varchar(8) NOT NULL DEFAULT 'GET'
);
Here's some data in it:
id | user_credit_id | visit_date | visit_path | method
----+----------------+---------------+---------------------------------------------+--------
1 | 1 | 1550094818029 | / | GET
2 | 1 | 1550094949537 | /mortgage/restapi/credit/{userId}/decrement | POST
3 | 1 | 1550094968651 | /mortgage/restapi/credit/{userId}/decrement | POST
4 | 1 | 1550094988557 | /mortgage/restapi/credit/{userId}/decrement | POST
5 | 1 | 1550094990820 | /index/UGiBGp0V | GET
6 | 1 | 1550094990929 | / | GET
7 | 2 | 1550095986310 | / | GET
...
So I tried these 3 variants (actually, dozens of others with no luck) with no success:
Solution A:
SELECT
visit_date as "time",
count(user_credit_id) AS "user_credit_id"
FROM visit
WHERE $__timeFilter(visit_date)
ORDER BY visit_date ASC
No data on graph. Error: pq: invalid input syntax for integer: "2019-02-14T13:16:50Z"
Solution B
SELECT
$__unixEpochFrom(visit_date),
count(user_credit_id) AS "user_credit_id"
FROM visit
GROUP BY time
ORDER BY user_credit_id
Series ASELECT
$__time(visit_date/1000,10m,previous),
count(user_credit_id) AS "user_credit_id A"
FROM
visit
WHERE
visit_date >= $__unixEpochFrom()::bigint*1000 and
visit_date <= $__unixEpochTo()::bigint*1000
GROUP BY 1
ORDER BY 1
No data on graph. No Error..
Solution C:
SELECT
$__timeGroup(visit_date, '1h'),
count(user_credit_id) AS "user_credit_id"
FROM visit
GROUP BY time
ORDER BY time
No data on graph. Error: pq: function pg_catalog.date_part(unknown, bigint) does not exist
Could someone please help me to sort out this simple problem as I think the query should be compact, naive and simple.. But Grafana docs demoing its syntax and features confuse me slightly.. Thanks in advance!

Use this query, which will works if visit_date is timestamptz:
SELECT
$__timeGroupAlias(visit_date,5m,0),
count(*) AS "count"
FROM visit
WHERE
$__timeFilter(visit_date)
GROUP BY 1
ORDER BY 1
But your visit_date is bigint so you need to convert it to timestamp (probably with TO_TIMESTAMP()) or you will need find other way how to use it with bigint. Use query inspector for debugging and you will see SQL generated by Grafana.

Jan Garaj, Thanks a lot! I should admit that your snippet and what's more valuable your additional comments advising to switch to SQL debugging dramatically helped me to make my "breakthrough".
So, the resulting query which solved my problem below:
SELECT
$__unixEpochGroup(visit_date/1000, '5m') AS "time",
count(user_credit_id) AS "Total Visits"
FROM visit
WHERE
'1970-01-01 00:00:00 GMT'::timestamp + ((visit_date/1000)::text)::interval BETWEEN
$__timeFrom()::timestamp
AND
$__timeTo()::timestamp
GROUP BY 1
ORDER BY 1
Several comments to decypher all this Grafana magic:
Grafana has its limited DSL to make configurable graphs, this set of functions converts into some meaningful SQL (this is where seeing "compiled" SQL helped me a lot, many thanks again).
To make my BIGINT column be appropriate for predefined Grafana functions we need to simply convert it to seconds from UNIX epoch so, in math language - just divide by 1000.
Now, WHERE statement seems not so simple and predictable, Grafana DSL works different where and simple division did not make trick and I solved it by using another Grafana functions to get FROM and TO points of time (period of time for which Graph should be rendered) but these functions generate timestamp type while we do have BIGINT in our column. So, thanks to Postgres we have a bunch of converter means to make it timestamp ('1970-01-01 00:00:00 GMT'::timestamp + ((visit_date/1000)::text)::interval - generates you one BIGINT value converted to Postgres TIMESTAMP with which Grafana deals just fine).
P.S. If you don't mind I've changed my question text to be more precise and detailed.

Pivot query by date in Amazon Redshift

I have a table in Redshift like:
category | date
----------------
1 | 9/29/2016
1 | 9/28/2016
2 | 9/28/2016
2 | 9/28/2016
which I'd like to turn into:
category | 9/29/2016 | 2/28/2016
--------------------------------
1 | 1 | 1
2 | 0 | 2
(count of each category for each date)
Pivot a table with Amazon RedShift / PostgreSQL seems to be helpful using CASE statements but that requires knowing all possible cases beforehand - how could I do this if the columns I want are every day starting from a given date?

There is no functionality provided with Amazon Redshift that can automatically pivot the data.
The Pivot a table with Amazon RedShift / PostgreSQL page you referenced shows how the output can be generated, but it is unable to automatically adjust the number of columns based upon the input data.
One option would be to write a program that queries available date ranges, then generates the SQL query. However, this isn't possible totally within Amazon Redshift.

You could do a self join on date, which i'm currently looking up how to do.

DB2 table partitioning and delete old records based on condition

I have a table with few million records.
___________________________________________________________
| col1 | col2 | col3 | some_indicator | last_updated_date |
-----------------------------------------------------------
| | | | yes | 2009-06-09.12.2345|
-----------------------------------------------------------
| | | | yes | 2009-07-09.11.6145|
-----------------------------------------------------------
| | | | no | 2009-06-09.12.2345|
-----------------------------------------------------------
I have to delete records which are older than month with some_indicator=no.
Again I have to delete records older than year with some_indicator=yes.This job will run everyday.
Can I use db2 partitioning feature for above requirement?.
How can I partition table using last_updated_date column and above two some_indicator values?
one partition should contain records falling under monthly delete criterion whereas other should contain yearly delete criterion records.
Are there any performance issues associated with table partitioning if this table is being frequently read,upserted?
Any other best practices for above requirement will surely help.

I haven't done much with partitioning (I've mostly worked with DB2 on the iSeries), but from what I understand, you don't generally want to be shuffling things between partitions (ie - making the partition '1 month ago'). I'm not even sure if it's even possible. If it was, you'd have to scan some (potentially large) portion of your table every day, just to move it (select, insert, delete, in a transaction).
Besides which, partitioning is a DB Admin problem, and it sounds like you just have a DB User problem - namely, deleting 'old' records. I'd just do this in a couple of statements:
DELETE FROM myTable
WHERE some_indicator = 'no'
AND last_updated_date < TIMESTAMP(CURRENT_DATE - 1 MONTH, TIME('00:00:00'))
and
DELETE FROM myTable
WHERE some_indicator = 'yes'
AND last_updated_date < TIMESTAMP(CURRENT_DATE - 1 YEAR, TIME('00:00:00'))
.... and you can pretty much ignore using a transaction, as you want the rows gone.
(as a side note, using 'yes' and 'no' for indicators is terrible. If you're not on a version that has a logical (boolean) type, store character '0' (false) and '1' (true))

How to create a PostgreSQL partitioned sequence?

Is there a simple (ie. non-hacky) and race-condition free way to create a partitioned sequence in PostgreSQL. Example:
Using a normal sequence in Issue:
| Project_ID | Issue |
| 1 | 1 |
| 1 | 2 |
| 2 | 3 |
| 2 | 4 |
Using a partitioned sequence in Issue:
| Project_ID | Issue |
| 1 | 1 |
| 1 | 2 |
| 2 | 1 |
| 2 | 2 |

I do not believe there is a simple way that is as easy as regular sequences, because:
A sequence stores only one number stream (next value, etc.). You want one for each partition.
Sequences have special handling that bypasses the current transaction (to avoid the race condition). It is hard to replicate this at the SQL or PL/pgSQL level without using tricks like dblink.
The DEFAULT column property can use a simple expression or a function call like nextval('myseq'); but it cannot refer to other columns to inform the function which stream the value should come from.
You can make something that works, but you probably won't think it simple. Addressing the above problems in turn:
Use a table to store the next value for all partitions, with a schema like multiseq (partition_id, next_val).
Write a multinextval(seq_table, partition_id) function that does something like the following:
Create a new transaction independent on the current transaction (one way of doing this is through dblink; I believe some other server languages can do it more easily).
Lock the table mentioned in seq_table.
Update the row where the partition id is partition_id, with an incremented value. (Or insert a new row with value 2 if there is no existing one.)
Commit that transaction and return the previous stored id (or 1).
Create an insert trigger on your projects table that uses a call to multinextval('projects_table', NEW.Project_ID) for insertions.
I have not used this entire plan myself, but I have tried something similar to each step individually. Examples of the multinextval function and the trigger can be provided if you want to attempt this...

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Why min function is not recognized by ksql - apache-kafka

Related

PySpark rolling operation with variable ranges

Graph in Grafana using Postgres Datasource with BIGINT column as time

Pivot query by date in Amazon Redshift

DB2 table partitioning and delete old records based on condition

How to create a PostgreSQL partitioned sequence?

Categories

Resources