KSQL create table with multi-column aggregation - apache-kafka

So essentially I'd like to group by two columns, like this:
CREATE TABLE foo AS
SELECT a, b, SUM(a)
FROM whatever
GROUP BY a, b
whatever is a stream in Kafka format. When I issue the command, ksql returns:
Key format does not support schema.
format: KAFKA
schema: Persistence{columns=[`a` STRING KEY, `b` STRING KEY], features=[]}
reason: The 'KAFKA' format only supports a single field. Got: [`a` STRING KEY, `b` STRING KEY]
Caused by: The 'KAFKA' format only supports a single field. Got: [`DEVICE`
a KEY, `b` STRING KEY]
The problem is that the Kafka format does not support multi-column keys. Is there a way to workaround this, e. g. creating an artificial key in this table? I did not manage to do this.
I saw someone posted a similiar question and the answer seemed to work. I suggest that's because of the format. https://stackoverflow.com/a/50193239/9632703
The documentation mentions that multi-column aggregations might not work, though also saying that ksql does a background workaround to make it work. Unfortunately ksql only returns the given error message. https://www.confluent.de/blog/ksqldb-0-10-updates-key-columns/#multi-column-aggregations
The funny part is that omitting the first line CREATE TABLE foo AS works. So if some data comes in, the aggregation works. But that's not persistent, of course. If nothing else works, I would also be fine with having a table without a primary key defined, if possible in ksql, since I could still identify the data with {a, b} in my application.
Can someone help me? Thank you.

You can do this if you upgrade to ksqlDB 0.15. This version introduced multi-key support. You'll need to use a KEY_FORMAT that supports it.
ksql> CREATE TABLE FOO AS SELECT A, B, SUM(C) FROM TEST_STREAM GROUP BY A,B;
Message
-----------------------------------
Created query with ID CTAS_FOO_53
-----------------------------------
ksql> DESCRIBE FOO;
Name : FOO
Field | Type
---------------------------------------------
A | BIGINT (primary key)
B | VARCHAR(STRING) (primary key)
KSQL_COL_0 | DOUBLE
---------------------------------------------
For runtime statistics and query details run: DESCRIBE EXTENDED <Stream,Table>;
ksql> SELECT * FROM FOO EMIT CHANGES LIMIT 5;
+---------------------------+---------------------------+---------------------------+
|A |B |KSQL_COL_0 |
+---------------------------+---------------------------+---------------------------+
|220071000 |AIS |0.4 |
|257838000 |AIS |6.2 |
|538007854 |AIS |22.700000000000003 |
|257487000 |AIS |2.4 |
|257601800 |AIS |5.8999999999999995 |
Limit Reached
Query terminated

Related

psql foreign key and join tables by it

I have 3 tables:
pet:
id | ...
vaccination:
id | petId | ...
visit:
id | petId | ...
in vaccination and visit tables petId are foreign keys.
when I'm trying to make join I have error
select
COALESCE(visit.petId, vaccination.petId) as petId,
GREATEST(visit.date, vaccination.date) as date
from visit
FULL JOIN vaccination on vaccination.petId = visit.petId
Or
(select petId from vaccination)
union all
(select petId from visit)
Error is:
column "petid" doesn't exist
Hint: Possibly a column reference was intended on "vaccination.petId"
I understand thet it is some kind of 'virtual' column generated by foreign key.
But I don't know how can I join it.
Thanks for help!
P.S. this tables were generated by typeorm
Column, scheme and table names are case sensitive, and when you don't have quote marks, PostgreSQL uses the lower case version of what you typed.
So use "petId", visit."petId" and vaccination."petId" to prevent unexpected lowercasing.
You can even decide to always have column, scheme and table names in quote marks, then you won't get caught out by this. For example "vaccination"."petId"

Postgres Full Text Search - Find Other Similar Documents

I am looking for a way to use Postgres (version 9.6+) Full Text Search to find other documents similar to an input document - essentially looking for a way to produce similar results to Elasticsearch's more_like_this query. Far as I can tell Postgres offers no way to compare ts_vectors to each other.
I've tried various techniques like converting the source document back into a ts_query, or reprocessing the original doc but that requires too much overhead.
Would greatly appreciate any advice - thanks!
Looks like the only option is to use pg_trgm instead of the Postgres built in full text search. Here is how I ended up implementing this:
Using a simple table (or materialized view in this case) - it holds the primary key to the post and the full text body in two columns.
Materialized view "public.text_index"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+----------+--------------+-------------
id | integer | | | | plain | |
post | text | | | | extended | |
View definition:
SELECT posts.id,
posts.body AS text
FROM posts
ORDER BY posts.publication_date DESC;
Then using a lateral join we can match rows and order them by similarity to find posts that are "close to" or "related" to any other post:
select * from text_index tx
left join lateral (
select similarity(tx.text, t.text) from text_index t where t.id = 12345
) s on true
order by similarity
desc limit 10;
This of course is a naive way to match documents and may require further tuning. Additionally using a tgrm gin index on the text column will speed up the searches significantly.

PostgreSQL UPDATE doesn't seem to update some rows

I am trying to update a table from another table, but a few rows simply don't update, while the other million rows work just fine.
The statement I am using is as follows:
UPDATE lotes_infos l
SET quali_ambiental = s.quali_ambiental
FROM sirgas_lotes_centroid s
WHERE l.sql = s.sql AND l.quali_ambiental IS NULL;
It says 647 rows were updated, but I can't see the change.
I've also tried without the is null clause, results are the same.
If I do a join it seems to work as expected, the join query I used is this one:
SELECT sql, l.quali_ambiental, c.quali_ambiental FROM lotes_infos l
JOIN sirgas_lotes_centroid c
USING (sql)
WHERE l.quali_ambiental IS NULL;
It returns 787 rows, (some are both null, that's ok), this is a sample from the result from the join:
sql | quali_ambiental | quali_ambiental
------------+-----------------+-----------------
1880040001 | | PA 10
1880040001 | | PA 10
0863690003 | | PA 4
0850840001 | | PA 4
3090500003 | | PA 4
1330090001 | | PA 10
1201410001 | | PA 9
0550620002 | | PA 6
0430790001 | | PA 1
1340180002 | | PA 9
I used QGIS to visualize the results, and could not find any tips to why it is happening. The sirgas_lotes_centroid comes from the other table, the geometry being the centroid for the polygon. I used the centroid to perform faster spatial joins and now need to place the information into the table with the original polygon.
The sql column is type text, quali_ambiental is varchar(6) for both.
If a directly update one row using the following query it works just fine:
UPDATE lotes_infos
SET quali_ambiental = 'PA 1'
WHERE sql LIKE '0040510001';
If you don't see results of a seemingly sound data-modifying query, the first question to ask is:
Did you commit your transaction?
Many clients work with auto-commit by default, but some do not. And even in the standard client psql you can start an explicit transaction with BEGIN (or syntax variants) to disable auto-commit. Then results are not visible to other transactions before the transaction is actually committed with COMMIT. It might hang indefinitely (which creates additional problems), or be rolled back by some later interaction.
That said, you mention: some are both null, that's ok. You'll want to avoid costly empty updates with something like:
UPDATE lotes_infos l
SET quali_ambiental = s.quali_ambiental
FROM sirgas_lotes_centroid s
WHERE l.sql = s.sql
AND l.quali_ambiental IS NULL
AND s.quali_ambiental IS NOT NULL; --!
Related:
How do I (or can I) SELECT DISTINCT on multiple columns?
The duplicate 1880040001 in your sample can have two explanations. Either lotes_infos.sql is not UNIQUE (even after filtering with l.quali_ambiental IS NULL). Or sirgas_lotes_centroid.sql is not UNIQUE. Or both.
If it's just lotes_infos.sql, your query should still work. But duplicates in sirgas_lotes_centroid.sql make the query non-deterministic (as #jjanes also pointed out). A target row in lotes_infos can have multiple candidates in sirgas_lotes_centroid. The outcome is arbitrary for lack of definition. If one of them has quali_ambiental IS NULL, it can explain what you observed.
My suggested query fixes the observed problem superficially, in that it excludes NULL values in the source table. But if there can be more than one non-null, distinct quali_ambiental for the same sirgas_lotes_centroid.sql, your query remains broken, as the result is arbitrary.You'll have to define which source row to pick and translate that into SQL.
Here is one example how to do that (chapter "Multiple matches..."):
Updating the value of a column
Always include exact table definitions (CREATE TABLE statements) with any such question. This would save a lot of time wasted for speculation.
Aside: Why are the sql columns type text? Values like 1880040001 strike me as integer or bigint. If so, text is a costly design error.

Graph in Grafana using Postgres Datasource with BIGINT column as time

I'm trying to construct very simple graph showing how much visits I've got in some period of time (for example for each 5 minutes).
I have Grafana of v. 5.4.0 paired well with Postgres v. 9.6 full of data.
My table below:
CREATE TABLE visit (
id serial CONSTRAINT visit_primary_key PRIMARY KEY,
user_credit_id INTEGER NOT NULL REFERENCES user_credit(id),
visit_date bigint NOT NULL,
visit_path varchar(128),
method varchar(8) NOT NULL DEFAULT 'GET'
);
Here's some data in it:
id | user_credit_id | visit_date | visit_path | method
----+----------------+---------------+---------------------------------------------+--------
1 | 1 | 1550094818029 | / | GET
2 | 1 | 1550094949537 | /mortgage/restapi/credit/{userId}/decrement | POST
3 | 1 | 1550094968651 | /mortgage/restapi/credit/{userId}/decrement | POST
4 | 1 | 1550094988557 | /mortgage/restapi/credit/{userId}/decrement | POST
5 | 1 | 1550094990820 | /index/UGiBGp0V | GET
6 | 1 | 1550094990929 | / | GET
7 | 2 | 1550095986310 | / | GET
...
So I tried these 3 variants (actually, dozens of others with no luck) with no success:
Solution A:
SELECT
visit_date as "time",
count(user_credit_id) AS "user_credit_id"
FROM visit
WHERE $__timeFilter(visit_date)
ORDER BY visit_date ASC
No data on graph. Error: pq: invalid input syntax for integer: "2019-02-14T13:16:50Z"
Solution B
SELECT
$__unixEpochFrom(visit_date),
count(user_credit_id) AS "user_credit_id"
FROM visit
GROUP BY time
ORDER BY user_credit_id
Series ASELECT
$__time(visit_date/1000,10m,previous),
count(user_credit_id) AS "user_credit_id A"
FROM
visit
WHERE
visit_date >= $__unixEpochFrom()::bigint*1000 and
visit_date <= $__unixEpochTo()::bigint*1000
GROUP BY 1
ORDER BY 1
No data on graph. No Error..
Solution C:
SELECT
$__timeGroup(visit_date, '1h'),
count(user_credit_id) AS "user_credit_id"
FROM visit
GROUP BY time
ORDER BY time
No data on graph. Error: pq: function pg_catalog.date_part(unknown, bigint) does not exist
Could someone please help me to sort out this simple problem as I think the query should be compact, naive and simple.. But Grafana docs demoing its syntax and features confuse me slightly.. Thanks in advance!
Use this query, which will works if visit_date is timestamptz:
SELECT
$__timeGroupAlias(visit_date,5m,0),
count(*) AS "count"
FROM visit
WHERE
$__timeFilter(visit_date)
GROUP BY 1
ORDER BY 1
But your visit_date is bigint so you need to convert it to timestamp (probably with TO_TIMESTAMP()) or you will need find other way how to use it with bigint. Use query inspector for debugging and you will see SQL generated by Grafana.
Jan Garaj, Thanks a lot! I should admit that your snippet and what's more valuable your additional comments advising to switch to SQL debugging dramatically helped me to make my "breakthrough".
So, the resulting query which solved my problem below:
SELECT
$__unixEpochGroup(visit_date/1000, '5m') AS "time",
count(user_credit_id) AS "Total Visits"
FROM visit
WHERE
'1970-01-01 00:00:00 GMT'::timestamp + ((visit_date/1000)::text)::interval BETWEEN
$__timeFrom()::timestamp
AND
$__timeTo()::timestamp
GROUP BY 1
ORDER BY 1
Several comments to decypher all this Grafana magic:
Grafana has its limited DSL to make configurable graphs, this set of functions converts into some meaningful SQL (this is where seeing "compiled" SQL helped me a lot, many thanks again).
To make my BIGINT column be appropriate for predefined Grafana functions we need to simply convert it to seconds from UNIX epoch so, in math language - just divide by 1000.
Now, WHERE statement seems not so simple and predictable, Grafana DSL works different where and simple division did not make trick and I solved it by using another Grafana functions to get FROM and TO points of time (period of time for which Graph should be rendered) but these functions generate timestamp type while we do have BIGINT in our column. So, thanks to Postgres we have a bunch of converter means to make it timestamp ('1970-01-01 00:00:00 GMT'::timestamp + ((visit_date/1000)::text)::interval - generates you one BIGINT value converted to Postgres TIMESTAMP with which Grafana deals just fine).
P.S. If you don't mind I've changed my question text to be more precise and detailed.

Publishing table back to tickerplant

I am trying to publish a table straight from a real time engine. Basically I have a real time engine that connects to the tickerplant, subscribes to a raw version of a table and adds some new columns. Now I want this enhanced version of the table to be pushed back to the tickerplant. I have a pub function which pushes the table in the follwoing way:
neg[handle](`.u.upd;`tablename;tabledata)
the problem is that I get a type error. I looked at the schemas of the table and they are slightly different.
meta table1
c | t f a
----------------| -----
time | p
sym | s
col1 | c
col2 | s
col3 | i
meta table2
c | t f a
----------------| -----
time | p
sym | s
col1 | C
col2 | s
col3 | i
That capital C most likely is the problem. However, I cannot load the schema in the tickerplant with capital letters. Any idea how should I go about this?
You can define the schema with a generic list type and it will take its type from the first insert.
tab:([] col1:`int$();generic:();col3:`$())
Another issue is that your tickerplant might be expecting a list (of lists) to be sent to its .u.upd rather than the table you may be sending to it, so you may want to value flip your table before sending it. (And note that the tickerplant would try to prepend a timestamp if the first column isn't already a timestamp)
The capital C in your meta table is the result of the incoming data being nested. To resolve this you should declare the schema with an untyped empty list.
table2:([] time:`timestamp$();sym:`$();col1:();col2:`$();col3:"I"$())
Consequently, until a result is entered its meta is:
q)meta table2
c | t f a
----| -----
time| p
sym | s
col1|
col2| s
col3| i
This will then be updated to match the first entry into the table.
Also, .u.upd requires the input to not be a table but a list of lists, this can be resolved
using:
neg[handle](`.u.upd;`tablename;value flip tabledata)