Trying to find time difference between 2 events in same table - postgresql

I have a postgre SQL table, which has timestamp of various events in a chat.
The table stores the following:
Timestamp | chat_id | event_type_id | user_id
What I am trying to accomplish, is to see the difference in time between event_type_id 1, and event_type_id 3, of each chat.
Or alternatively, the avg of the time difference, pr user_id.
So:
chat_id | time_difference(sec)
15 | 240
16 | 75
and so on, or:
user_id | avg_time
145 | 540
190 | 25
My SQL knowledge is quite basic, and I just cant seem to figure out how to do this correctly.
Any SQL wizards out there that could help me out?
I am doing this in Metabase if that makes any difference.

Related

Redshift latency, i.e. discrepancy between "execution time" and "total runtime"

I'm currently experimenting with Redshift and I've noticed that for a simple query like:
SELECT COUNT(*) FROM table WHERE column = 'value';
The execution time reported by Redshift is only 84ms, which is expected and pretty good with the table at ~33M rows. However, the total runtime, both observed on my local psql client as well as in Redshift's console UI is 5 seconds. I've tried the query on both a single node cluster and multi-node (2 nodes and 4 nodes) clusters.
In addition, when I try with more realistic, complicated queries, I can see similarly that the query execution itself is only ~500ms in a lot of cases, but the total runtime is ~7 seconds.
What causes this discrepancy? Is there anyway to reduce this latency? Any internal table to dive deeper into the time distribution that covers the entire end-to-end runtime?
I read about Cold query performance improvements that Amazon recently introduced, but this latency seems to be there even on queries past the first cold one, as long as I alter the value in my where clause. However, the latency is somewhat inconsistent but definitely still go all the way up to 5 seconds.
-- Edited to give more details based on Bill Weiner's answer below --
There is no difference between doing SELECT COUNT(*) vs SELECT COUNT(column) (where column is a dist key to avoid skew).
There are absolutely zero other activities happening on the cluster because this is for exploration only. I'm the only one issuing queries and making connections to the DB, so there should be no queueing or locking delays.
The data resides in the Redshift database, with a normal schema and common-sense dist key and sort key. I have not added explicit compression to any columns, so everything is just AUTO right now.
Looks like compile time is the culprit!
STL_WLM_QUERY shows that for query 12599, this is the exec_start_time/exec_end_time:
-[ RECORD 1 ]------------+-----------------------------------------------------------------
userid | 100
xid | 14812605
task | 7289
query | 12599
service_class | 100
slot_count | 1
service_class_start_time | 2021-04-22 21:46:49.217
queue_start_time | 2021-04-22 21:46:49.21707
queue_end_time | 2021-04-22 21:46:49.21707
total_queue_time | 0
exec_start_time | 2021-04-22 21:46:49.217077
exec_end_time | 2021-04-22 21:46:53.762903
total_exec_time | 4545826
service_class_end_time | 2021-04-22 21:46:53.762903
final_state | Completed
est_peak_mem | 2097152
query_priority | Normal
service_class_name | Default queue
And from SVL_COMPILE, we have:
userid | xid | pid | query | segment | locus | starttime | endtime | compile
--------+----------+-------+-------+---------+-------+----------------------------+----------------------------+---------
100 | 14812605 | 30442 | 12599 | 0 | 1 | 2021-04-22 21:46:49.218872 | 2021-04-22 21:46:53.744529 | 1
100 | 14812605 | 30442 | 12599 | 2 | 2 | 2021-04-22 21:46:53.745711 | 2021-04-22 21:46:53.745728 | 0
100 | 14812605 | 30442 | 12599 | 3 | 2 | 2021-04-22 21:46:53.761989 | 2021-04-22 21:46:53.762015 | 0
100 | 14812605 | 30442 | 12599 | 1 | 1 | 2021-04-22 21:46:53.745476 | 2021-04-22 21:46:53.745503 | 0
(4 rows)
It shows that compile took from 21:46:49.218872 to 2021-04-22 21:46:53.744529, i.e. the overwhelming majority of the 4545ms total exec time.
There's a lot that could be taking up this time. Looking a more of the query and queuing statistics will help track down what is happening. Here are a few possibilities that I've seen be significant in the past:
Date return time. Since your query is an open select and could be returning a meaningful amount of data and moving this over the network to the requesting computer takes time.
Queuing delays. What else is happening on your cluster? Does you query start right away or does it need to wait for a slot?
Locking delays. What else is happening on your cluster? Are data/tables changing? Is the data your query needs being committed elsewhere?
Compile time. Is this the first time this query is run?
Is the table external? In S3 as an external table. Or are you using the new rs3 instance type? All the source data is in S3. (I'm guessing you are not on rs3 nodes but it doesn't hurt to ask)
A place to start is STL_WLM_QUERY to see where the query is spending this extra time.

Group staggered records that are separated by a small time difference

Difficult question to title, but I am trying to replicate what social media or notification feeds do where they batch recent events so they can display “sequences” of actions. For example, if these are "like" records, in reverse chronological order:
like_id | user_id | like_timestamp
--------------------------------
1 | bob | 12:30:00
2 | bob | 12:29:00
3 | jane | 12:27:00
4 | bob | 12:26:00
5 | jane | 12:24:00
6 | jane | 12:23:00
7 | scott | 12:22:00
8 | bob | 12:20:00
9 | alice | 12:19:00
10 | scott | 12:18:00
I would like to group them such that I get the last 3 "bursts" of user likes, grouped (partitioned?) by user. If the "burst" rule is that likes less than 5 minutes apart belong to the same burst, then we would get:
user_id | num_likes | burst_start | burst_end
----------------------------------------------
bob | 3 | 12:26:00 | 12:30:00
jane | 3 | 12:23:00 | 12:27:00
scott | 2 | 12:18:00 | 12:22:00
alice's like does not get counted because it's part of the 4th most recent batch, and like 8 does not get added to bob's tally because it is 6 minutes before the next one.
I've tried keeping track of bursts with postgres' lag function, which lets me mark start and end events, but since like events can be staggered, I have no way of tying a like back to its "originator" (for example, tying id 4 back to 2).
Is this grouping possible? If so, is it possible to keep track of the start and end timestamp of each burst?
step-by-step demo:db<>fiddle
WITH group_ids AS ( -- 1
SELECT DISTINCT
user_id,
first_value(like_id) OVER (PARTITION BY user_id ORDER BY like_id) AS group_id
FROM
likes
LIMIT 3
)
SELECT
user_id,
COUNT(*) AS num_likes,
burst_start,
burst_end
FROM (
SELECT
user_id,
-- 4
first_value(like_timestamp) OVER (PARTITION BY group_id ORDER BY like_id) AS burst_end,
first_value(like_timestamp) OVER (PARTITION BY group_id ORDER BY like_id DESC) AS burst_start
FROM (
SELECT
l.*, gi.group_id,
-- 2
lag(like_timestamp) OVER (PARTITION BY group_id ORDER BY like_id) - like_timestamp AS diff
FROM
likes l
JOIN
group_ids gi ON l.user_id = gi.user_id
) s
WHERE diff IS NULL OR diff <= '00:05:00' -- 3
) s
GROUP BY user_id, burst_start, burst_end -- 5
The CTE is for creating an ordered group id per user_id. So the first user (here the most recent one) gets the lowest group_id (which is bob). The second user the second highest (jane) and so on. This is used to able to work with all likes of a certain user within one partition. This step is necessary because you cannot simply order by user_id which would bring alice to the top. The LIMIT 3 limitates the whole query to the first three users.
After joining the calculated user's group_id the time differences are calculated using the lag() window function which allows you to get the previous value. So it can be used to easily calculate the difference between the current timestamp the the previous one. This happens only within the user's groups.
After that the likes that are to much away (more than 5 minutes from the last one) can be removed through the calculated diff
Then the highest and lower timestamp can be calculated with the first_value() window function (ascending and descending order). These mark your burst_start and burst_end
Finally you can group all users and count their records.

Graph in Grafana using Postgres Datasource with BIGINT column as time

I'm trying to construct very simple graph showing how much visits I've got in some period of time (for example for each 5 minutes).
I have Grafana of v. 5.4.0 paired well with Postgres v. 9.6 full of data.
My table below:
CREATE TABLE visit (
id serial CONSTRAINT visit_primary_key PRIMARY KEY,
user_credit_id INTEGER NOT NULL REFERENCES user_credit(id),
visit_date bigint NOT NULL,
visit_path varchar(128),
method varchar(8) NOT NULL DEFAULT 'GET'
);
Here's some data in it:
id | user_credit_id | visit_date | visit_path | method
----+----------------+---------------+---------------------------------------------+--------
1 | 1 | 1550094818029 | / | GET
2 | 1 | 1550094949537 | /mortgage/restapi/credit/{userId}/decrement | POST
3 | 1 | 1550094968651 | /mortgage/restapi/credit/{userId}/decrement | POST
4 | 1 | 1550094988557 | /mortgage/restapi/credit/{userId}/decrement | POST
5 | 1 | 1550094990820 | /index/UGiBGp0V | GET
6 | 1 | 1550094990929 | / | GET
7 | 2 | 1550095986310 | / | GET
...
So I tried these 3 variants (actually, dozens of others with no luck) with no success:
Solution A:
SELECT
visit_date as "time",
count(user_credit_id) AS "user_credit_id"
FROM visit
WHERE $__timeFilter(visit_date)
ORDER BY visit_date ASC
No data on graph. Error: pq: invalid input syntax for integer: "2019-02-14T13:16:50Z"
Solution B
SELECT
$__unixEpochFrom(visit_date),
count(user_credit_id) AS "user_credit_id"
FROM visit
GROUP BY time
ORDER BY user_credit_id
Series ASELECT
$__time(visit_date/1000,10m,previous),
count(user_credit_id) AS "user_credit_id A"
FROM
visit
WHERE
visit_date >= $__unixEpochFrom()::bigint*1000 and
visit_date <= $__unixEpochTo()::bigint*1000
GROUP BY 1
ORDER BY 1
No data on graph. No Error..
Solution C:
SELECT
$__timeGroup(visit_date, '1h'),
count(user_credit_id) AS "user_credit_id"
FROM visit
GROUP BY time
ORDER BY time
No data on graph. Error: pq: function pg_catalog.date_part(unknown, bigint) does not exist
Could someone please help me to sort out this simple problem as I think the query should be compact, naive and simple.. But Grafana docs demoing its syntax and features confuse me slightly.. Thanks in advance!
Use this query, which will works if visit_date is timestamptz:
SELECT
$__timeGroupAlias(visit_date,5m,0),
count(*) AS "count"
FROM visit
WHERE
$__timeFilter(visit_date)
GROUP BY 1
ORDER BY 1
But your visit_date is bigint so you need to convert it to timestamp (probably with TO_TIMESTAMP()) or you will need find other way how to use it with bigint. Use query inspector for debugging and you will see SQL generated by Grafana.
Jan Garaj, Thanks a lot! I should admit that your snippet and what's more valuable your additional comments advising to switch to SQL debugging dramatically helped me to make my "breakthrough".
So, the resulting query which solved my problem below:
SELECT
$__unixEpochGroup(visit_date/1000, '5m') AS "time",
count(user_credit_id) AS "Total Visits"
FROM visit
WHERE
'1970-01-01 00:00:00 GMT'::timestamp + ((visit_date/1000)::text)::interval BETWEEN
$__timeFrom()::timestamp
AND
$__timeTo()::timestamp
GROUP BY 1
ORDER BY 1
Several comments to decypher all this Grafana magic:
Grafana has its limited DSL to make configurable graphs, this set of functions converts into some meaningful SQL (this is where seeing "compiled" SQL helped me a lot, many thanks again).
To make my BIGINT column be appropriate for predefined Grafana functions we need to simply convert it to seconds from UNIX epoch so, in math language - just divide by 1000.
Now, WHERE statement seems not so simple and predictable, Grafana DSL works different where and simple division did not make trick and I solved it by using another Grafana functions to get FROM and TO points of time (period of time for which Graph should be rendered) but these functions generate timestamp type while we do have BIGINT in our column. So, thanks to Postgres we have a bunch of converter means to make it timestamp ('1970-01-01 00:00:00 GMT'::timestamp + ((visit_date/1000)::text)::interval - generates you one BIGINT value converted to Postgres TIMESTAMP with which Grafana deals just fine).
P.S. If you don't mind I've changed my question text to be more precise and detailed.

Know which table are affected by a connection

I want to know if there is a way to retrieve which table are affected by request made from a connection in PostgreSQL 9.5 or higher.
The purpose is to have the information in such a way that will allow me to know which table where affected, in which order and in what way.
More precisely, something like this will suffice me :
id | datetime | id_conn | id_query | table | action
---+----------+---------+----------+---------+-------
1 | ... | 2256 | 125 | user | select
2 | ... | 2256 | 125 | order | select
3 | ... | 2256 | 125 | product | select
(this will be the result of a select query from user join order join product).
I know I can retrieve id_conn througth "pg_stat_activity", and I can see if there is a running query, but I can't find an "history" of the query.
The final purpose is to debug the database when incoherent data are inserted into the table (due to a lack of constraint). Knowing which connection do the insert will lead me to find the faulty script (as I have already the script name and the id connection linked).

Select distinct rows from MongoDB

How do you select distinct records in MongoDB? This is a pretty basic db functionality I believe but I can't seem to find this anywhere else.
Suppose I have a table as follows
--------------------------
| Name | Age |
--------------------------
|John | 12 |
|Ben | 14 |
|Robert | 14 |
|Ron | 12 |
--------------------------
I would like to run something like SELECT DISTINCT age FROM names WHERE 1;
db.names.distinct('age')
Looks like there is a SQL mapping chart that I overlooked earlier.
Now is a good time to say that using a distinct selection isn't the best way to go around querying things. Either cache the list in another collection or keep your data set small.