Push query to ksqlDB not returning final result in first result row - confluent-platform

I'm trying to get the count of events in a ksqlDB table within an arbitrary time window.
The table my_table was created with a WINDOW SESSION.
It is important to note the query is being run after all data was processed, and the ksqlDB server is basically doing nothing.
My query looks something like this
count(*) as count
FROM my_table
WHERE WINDOWSTART < (1602010972370 + 5000) AND WINDOWEND > 1602010972370
group by 1 emit changes;
Running this kind of query will very often return one result row, and immediately after a second result row with the actual "final" result.
It doesn't look like its a result of values in the table not being "settled" yet, because if I repeat the same query (as many times as I want) I get the same exact behavior.
I'm assuming there is some configuration value which will let ksqlDB to wait just a little longer (in the order of one second) before it returns the result, so I could get the final result in the first row?
BTW using emit final will not work on the query itself since it only apply to "windowed querys"

Related

grouping multiple queries into a single one, with Postgres

I have a very simple query:
SELECT * FROM someTable
WHERE instrument = '{instrument}' AND ts >= '{fromTime}' AND ts < '{toTime}'
ORDER BY ts
That query is applied to 3 tables across 2 databases.
I receive a list of rows that have timestamps (ts). I take the last timestamp and it serves as the basis for the 'fromTime' of the next iteration. toTime is usually equal to 'now'.
This allows me to only get new rows at every iteration.
I have about 30 instrument types and I need an update every 1s.
So that's 30 instruments * 3 queries = 90 queries per second.
How can I rewrite the query so that I could use a function like this:
getData table [(instrument, fromTime) list] toTime
and get back some dictionary, in the form:
Dictionary<instrument, MyDataType list>
To use a list of instruments, I could do something like:
WHERE instrument in '{instruments list}'
but this wouldn't help with the various fromTime as there is one value per instrument.
I could take the min of all fromTime values, get the data for all instruments and then filter the results out, but that's wasteful since I could potentially query a lot of data to throw is right after.
What is the right strategy for this?
So there is a single toTime to test against per query, but a different fromTime per instrument.
One solution to group them in a single query would be to pass a list of (instrument, fromTime) couples as a relation.
The query would look like this:
SELECT [columns] FROM someTable
JOIN (VALUES
('{name of instrument1}', '{fromTime for instrument1}'),
('{name of instrument2}', '{fromTime for instrument2}'),
('{name of instrument3}', '{fromTime for instrument3}'),
...
) AS params(instrument, fromTime)
ON someTable.instrument = params.instrument AND someTable.ts >= params.fromTime
WHERE ts < 'toTime';
Depending on your datatypes and what method is used by the client-side driver
to pass parameters, you may have to be explicit about the datatype of
your parameters by casting the first value of the list, as in, for
example:
JOIN (VALUES
('name of instrument1', '{fromTime for instrument1}'::timestamptz),
If you had much more than 30 values, a variant of this query with arrays as parameters (instead of the VALUES clause) could be preferred. The difference if that it would take 3 parameters: 2 arrays + 1 upper bound, instead of N*2+1 parameters. But it depends on the ability of the client-side driver to support Postgres arrays as a datatype, and the ability to pass them as a single value.

Postgres Query Shows one Item twice with offset and limit set

I have this query written in Postgresql.
SELECT "api_issue".id
FROM "api_issue"
LEFT JOIN api_issue_categories ON api_issue_categories.issue_id = api_issue.id
AND api_issue_categories.categories_id = '1126'
WHERE api_issue_categories.categories_id = '1126'
ORDER BY api_issue.published_date LIMIT '20' OFFSET '40'
This query returns the following.
ID
313279
312740
.....
313953
The key here is ID 313953
Now I adjust the queries offset to 60.
SELECT "api_issue".id
FROM "api_issue"
LEFT JOIN api_issue_categories ON api_issue_categories.issue_id = api_issue.id
AND api_issue_categories.categories_id = '1126'
WHERE api_issue_categories.categories_id = '1126'
ORDER BY api_issue.published_date LIMIT '20' OFFSET '60'
And the following results are returned.
ID
313953
.....
312740
313454
Notice that 313953 is returned as the first result.
So the problem is that ID 313953 is returned as the last result in the initial query and the first result in the second query. I've verified that there is only 1 entry for this record in the JOIN table.
The extremely strange thing is that you would think this would happen consistently. That the last returned ID would be the first returned ID in the next query, but this only happens when the initial offset is 40 and the second query uses an offset of 60.
This query is used on the front end as a paging result and this is the only entry out of 175 that shows up twice for some reason.
Does anyone have any idea?? I'm baffled.
There are two possibilities how this can happen.
Either api_issue.id is not unique in the result. This can happen if the api_issue is in the same category multiple times.
More likely though is that there are multiple issues with the same publish date. This way there is no guarantee on how they are ordered. Adding a secondary sort key will help to get a stable order. E.g ORDER BY api_issue.published_date, api_issue.id

Converting complex query with inner join to tableau

I have a query like this, which we use to generate data for our custom dashboard (A Rails app) -
SELECT AVG(wait_time) FROM (
SELECT TIMESTAMPDIFF(MINUTE,a.finished_time,b.start_time) wait_time
FROM (
SELECT max(start_time + INTERVAL avg_time_spent SECOND) finished_time, branch
FROM mytable
WHERE name IN ('test_name')
AND status = 'SUCCESS'
GROUP by branch) a
INNER JOIN
(
SELECT MIN(start_time) start_time, branch
FROM mytable
WHERE name IN ('test_name_specific')
GROUP by branch) b
ON a.branch = b.branch
HAVING avg_time_spent between 0 and 1000)t
GROUP BY week
Now I am trying to port this to tableau, and I am not being able to find a way to represent this data in tableau. I am stuck at how to represent the inner group by in a calculated field. I can also try to just use a custom sql data source, but I am already using another data source.
columns in mytable -
start_time
avg_time_spent
name
branch
status
I think this could be achieved new Level Of Details formulas, but unfortunately I am stuck at version 8.3
Save custom SQL for rare cases. This doesn't look like a rare case. Let Tableau generate the SQL for you.
If you simply connect to your table, then you can usually write calculated fields to get the information you want. I'm not exactly sure why you have test_name in one part of your query but test_name_specific in another, so ignoring that, here is a simplified example to a similar query.
If you define a calculated field called worst_case_test_time
datediff(min(start_time), dateadd('second', max(start_time), avg_time_spent)), which seems close to what your original query says.
It would help if you explained what exactly you are trying to compute. It appears to be some sort of worst case bound for avg test time. There may be an even simpler formula, but its hard to know without a little context.
You could filter on status = "Success" and avg_time_spent < 1000, and place branch and WEEK(start_time) on say the row and column shelves.
P.S. Your query seems a little off. Don't you need an aggregation function like MAX or AVG after the HAVING keyword?

SQL Select rows by comparison of value to aggregated function result

I have a table listing (gameid, playerid, team, max_minions) and I want to get the players within each team that have the lowest max_minions (within each team, within each game). I.e. I want a list (gameid, team, playerid_with_lowest_minions) for each game/team combination.
I tried this:
SELECT * FROM MinionView GROUP BY gameid, team
HAVING MIN(max_minions) = max_minions;
Unfortunately, this doesn't seem to work as it seems to select a random row from the available rows for each (gameid, team) and then does the HAVING comparison. If the randomly selected row doesn't match, it's simply skipped.
Using WHERE won't work either since you can't use aggregate functions within WHERE clauses.
LIMIT won't work since I have many more games and LIMIT limits the total number of rows returned.
Is there any way to do this without adding another table/view that contains (gameid, teamid, MIN(max_minions))?
Example data:
sqlite> SELECT * FROM MinionView;
gameid|playerid|team|champion|max_minions
21|49|100|Champ1|124
21|52|100|Champ2|18
21|53|100|Champ3|303
21|54|200|Champ4|356
21|57|200|Champ5|180
21|58|200|Champ6|21
64|49|100|Champ7|111
64|50|100|Champ8|208
64|53|100|Champ9|8
64|54|200|Champ0|226
64|55|200|ChampA|182
64|58|200|ChampB|15
...
Expected result (I mostly care about playerid, but included champion, max_minions here for better overview):
21|52|100|Champ2|18
21|58|200|Champ6|21
64|53|100|Champ9|8
64|58|200|ChampB|15
...
I'm using Sqlite3 under Python 3.1 if that matters.
This is in SQL Server, hopefully the syntax works for you too:
SELECT
MV.*
FROM
(
SELECT
team, gameid, min(max_minions) as maxmin
FROM
MinionView
GROUP BY
team, gameid
) groups
JOIN MinionView MV ON
MV.team = groups.team
AND MV.gameid = groups.gameid
AND MV.max_minions = groups.maxmin
In words, first you make the usual grouping query (the nested one). At this point you have the min value for each group but you don't know to which row it belongs. For this you join with the original table and match the "keys" (team, game and min) to get the other columns as well.
Note that if a team will have more than one member with the same value for max_minions then all these rows will be selected. If you only want one of them then that's probably a bit more complicated.

Does DataReader.NextResult retrieves the result is always the same order

I have a SELECT query that yields multiple results and do not have any ORDER BY clause.
If I execute this query multiple times and then iterate through results using DataReader.NextResult(), would I be guaranteed to get the results in the same order?
For e.g. if I execute the following query that return 199 rows:
SELECT * FROM products WHERE productid < 200
would I always get the first result with productid = 1 and so on?
As far as I have observed it always return the results in same order, but I cannot find any documentation for this behavior.
======================================
As per my research:
Check out this blog Conor vs. SQL. I actually wanted to ask if the query-result changes even if the data in table remains the same (i.e no update or delete). But it seems like in case of large table, when SQL server employees parallelism, the order can be different
First of all, to iterate the rows in a DataReader, you should call Read, not NextResult.
Calling NextResult will move to the next result set if your query has multiple SELECT statements.
To answer your question, you must not rely on this.
A query without an ORDER BY clause will return rows in SQL Server's default iteration order.
For small tables, this will usually be the order in which the rows were added, but this is not guaranteed and is liable to change at any time. For example, if the table is indexed or partitioned, the order will be different.
No, DataReader will return the results in the order they come back from SQL. If you don't specify an ORDER BY clause, that will be the order that they exist in the table.
It is possible, perhaps even likely that they will always return in the same order, but this isn't guaranteed. The order is determined by the queryplan (at least in SQL Server) on the database server. If something changes that queryplan, the order could change. You should always use ORDER BY if the order of results is in anyway important to your processing of the data.