LAG() Function To Get The Average REDSHIFT - amazon-redshift

I am trying to find the AVERAGE time users spend whilst going through registration within an app.
This is my dataset:
customer_id
app_event
timestamp
1
OPEN_APP
'2022-01-01 19:00:25'
1
CLICK_REGISTER
'2022-01-01 19:00:30'
1
ENTER_DETAILS
'2022-01-01 19:00:40'
1
CLOSE_APP
'2022-01-01 19:00:50'
2
OPEN_APP
'2022-01-01 20:00:25'
2
CLICK_REGISTER
'2022-01-01 20:00:26'
2
ENTER_DETAILS
'2022-01-01 20:00:27'
2
CLOSE_APP
'2022-01-01 20:00:28'
This is my query:
WITH cte AS (
SELECT
customer_id,
app_event,
timestamp AS ts,
EXTRACT(EPOCH ((ts - lag(ts, 1) OVER (PARTITION BY customer_id, app_event ORDER BY ts ASC))) AS time_spent
FROM table
GROUP BY customer_id, app_event, timestamp
)
SELECT
app_event,
AVG(time_spent)
FROM cte
GROUP BY app_event
This is the outcome I want:
app_event
time_spent
OPEN_APP
CLICK_REGISTER
3
ENTER_DETAILS
5.5
CLOSE_APP
5.5

I see a few issues with your query. You haven't posted your exact concern so I'll operate from these.
First EXTRACT() operates on a timestamp but the difference between two timestamps is an interval. I think you want to use DATEDIFF(sec, ts1, ts2) which will give the time difference (in seconds) between the two timestamps.
Most importantly you are partitioning by app_event which will make LAG() only consider events with the same value. So you will get the difference between the timestamps for only for CLICK_REGISTER for example. You need to remove app_event from the partition list of the LAG() function.

Related

Postgres: How to change start day of week and apply it in date_part?

with partial as(
select
date_part('week', activated_at) as weekly,
count(*) as count
from vendors
where activated_at notnull
group by weekly
)
This is the query counts number of vendors activating per week. I need to change the start day of week from Monday to Saturday. Similar posts like how to change the first day of the week in PostgreSQL or Making Postgres date_trunc() use a Sunday based week but non explain how to embed it in date_part function. I would like to know how to use this function in my query and start day from Saturday.
Thanks in advance.
maybe a little bit overkill for that, you can use some ctes and window functions, so first generate your intervals, start with your first saturday, you want e.g. 2018-01-06 00:00 and the last day you want 2018-12-31, then select your data, join it , sum it and as benefit you also get weeks with zero activations:
with temp_days as (
SELECT a as a ,
a + '7 days'::interval as e
FROM generate_series('2018-01-06 00:00'::timestamp,
'2018-12-31 00:00', '7 day') as a
),
temp_data as (
select
1 as counter,
vendors.activated_at
from vendors
where activated_at notnull
),
temp_order as
(
select *
from temp_days
left join temp_data on temp_data.activated_at between (temp_days.a) and (temp_days.e)
)
select
distinct on (temp_order.a)
temp_order.a,
temp_order.e,
coalesce(sum(temp_order.counter) over (partition by temp_order.a),0) as result
from temp_order

How to group by month off a timestamp field from Redshift in Superset

I am trying to show some trend over month in Superset from a table which has a timestamp field called created_at but have no idea how to get it right.
The SQL query generated from this is the followings:
SELECT
DATE_TRUNC('month', created_at) AT TIME ZONE 'UTC' AS __timestamp,
SUM(cost) AS "SUM(cost)"
FROM xxxx_from_redshift
WHERE created_at >= '2017-01-01 00:00:00'
AND created_at <= '2018-07-25 20:42:13'
GROUP BY DATE_TRUNC('month', created_at) AT TIME ZONE 'UTC'
ORDER BY "SUM(cost)" DESC
LIMIT 50000;
Like I mentioned above, I don't know how to make this work and 2nd question is why ORDER BY is using SUM(cost)? If this is a time-series, shouldn't it use ORDER BY 1 instead? I tried to change Sort By but to no avail.
It is quite silly but I found that SUM(cost) doesn't work while sum(cost) works. It is a bug in Superset and will be addressed in https://github.com/apache/incubator-superset/pull/5487

PostgreSQL row diff timestamp, and calculate stddev for group

I have a table with an ID column called mmsi and another column of timestamp, with multiple timestamps per mmsi.
For each mmsi I want to calculate the standard deviation of the difference between consecutive timestamps.
I'm not very experienced with SQL but have tried to construct a function as follows:
SELECT
mmsi, stddev(time_diff)
FROM
(SELECT mmsi,
EXTRACT(EPOCH FROM (timestamp - lag(timestamp) OVER (ORDER BY mmsi ASC, timestamp ASC)))
FROM ais_messages.ais_static
ORDER BY mmsi ASC, timestamp ASC) AS time_diff
WHERE time_diff IS NOT NULL
GROUP BY mmsi;
Your query looks on the right track, but it has several problems. You labelled your subquery, which looks almost right, with an alias which you then select. But this subquery returns multiple rows and columns so this doesn't make any sense. Here is a corrected version:
SELECT
t.mmsi,
STDDEV(t.time_diff) AS std
FROM
(
SELECT
mmsi,
EXTRACT(EPOCH FROM (timestamp - LAG(timestamp) OVER
(PARTITION BY mmsi ORDER BY timestamp))) AS time_diff
FROM ais_messages.ais_static
ORDER BY mmsi, timestamp
) t
WHERE t.time_diff IS NOT NULL
GROUP BY t.mmsi
This approach should be fine but there is one edge case where it might not behave as expected. If a given mmsi group have only one record, then it would not even appear in the result set of standard deviations. This is because the LAG calculation would return NULL for that single record and it would be filtered off.

Converting a MySQL GROUP BY to Postgres

I'm working out a query that I've ran successfully in MySQL for a while, but in Postgres it's not working with the ole -
ERROR: column "orders.created_at" must appear in the GROUP BY clause
or be used in an aggregate function
Here's the query:
SELECT SUM(total) AS total, to_char(created_at, 'YYYY/MM/DD') AS order_date
FROM orders
WHERE created_at >= (NOW() - INTERVAL '2 DAYS')
GROUP BY to_char(created_at, 'DD')
ORDER BY created_at ASC;
It's just supposed to return something like this:
total | order_date
---------+------------
1099.90 | 2013/01/15
650.00 | 2013/01/16
4399.00 | 2013/01/17
The main thing is I want the sum grouped by each individual day of the month.
Anyone have ideas?
UPDATE:
The reason I'm grouping by day is because the graph will be labeled with each day of the month, and the total sales for each.
1st - $3400.00
2nd - $2237.00
3rd - $1489.00
etc.
I'm not sure why you're doing a conversion there. I think the better thing to do would be this:
SELECT
SUM(total) AS total,
created_at::date AS order_date
FROM
orders
WHERE
created_at >= (NOW() - INTERVAL '2 DAYS')
GROUP BY
created_at::date
ORDER BY
created_at::date ASC;
I would recommend this query and then format the daily labels in your graph through the graph settings to ensure you do not have any weird issues of the same day in different months getting grouped. However, to get what you display in your edit you can do this:
SELECT
SUM(total) AS total,
to_char(created_at, 'DDth') AS order_date
FROM
orders
WHERE
created_at >= (NOW() - INTERVAL '2 DAYS')
GROUP BY
to_char(created_at, 'DDth')
ORDER BY
to_char(created_at, 'DDth') ASC;
Here is the sql you need in order to run this. The group by and order by need to contain the same expression.
SELECT SUM(total) AS total,
to_char(created_at, 'YYYY/MM/DD') AS order_date
FROM orders
WHERE created_at >= (NOW() - INTERVAL '2 DAYS')
GROUP BY to_char(created_at, 'YYYY/MM/DD')
order by to_char(created_at, 'YYYY/MM/DD')
http://sqlfiddle.com/#!12/52d99/2
Hope this helps,
Matt

Select Data over time period

I'm a bit of newbie when it comes to postgres, so bear with me a wee bit and i'll see if i can put up enough information.
i insert weather data into a table every 10 mins, i have a time column that is stamped with an epoch date.
I Have a column of the last hrs rain fall, and every hr that number changes of course with the running total (for that hour).
What i would like to do is skim through the rows to the end of each hour, and get that row, but do it over the last 4 hours, so i would only be returning 4 rows say.
Is this possible in 1 query? Or should i do multiple queries?
I would like to do this in 1 query but not fussed...
Thanks
Thanks guys for your answers, i was/am a bit confused by yours gavin - sorry:) comes from not knowing this terribly well.
I'm still a bit unsure about this, so i'll try and explain it a bit better..
I have a c program that inserts data into the database every 10 mins, it reads the data fom a device that keeps the last hrs rain fall, so every 10 mins it could go up by x amount.
So i guess i have 6 rows / hr of data.
My plan was to go back (in my php page) every 7, which would be the last entry for every hour, and just grab that value. Hence why i would only ever need 4 rows.. just spaced out a bit!
My table (readings) has data like this
index | time (text) | last hrs rain fall (text)
1 | 1316069402 | 1.2
All ears to better ways of storing it too :) I very much appreciate your help too guys thanks.
You should be able to do it in one query...
Would something along the lines of:
SELECT various_columns,
the_hour,
SUM ( column_to_be_summed )
FROM ( SELECT various_columns,
column_to_be_summed,
extract ( hour FROM TIME ) AS the_hour
FROM readings
WHERE TIME > ( NOW() - INTERVAL '4 hour' ) ) a
GROUP BY various_columns,
the_hour ;
do what you need?
SELECT SUM(rainfall) FROM weatherdata WHERE time > (NOW() - INTERVAL '4 hour' );
I don't know column names but that should do it the ones in caps are pgsql types. Is that what you are after?
I am not sure if this is exactly what you are looking for but perhaps it may serve as a basis for adaptation.
I often have a requirment for producing summary data over time periods though I don't use epoch time so there may be better ways of manipulating the values than I have come up with.
create and populate test table
create table epoch_t(etime numeric);
insert into epoch_t
select extract(epoch from generate_series(now(),now() - interval '6 hours',interval '-10 minutes'));
To divide up time into period buckets:
select generate_series(to_char(now(),'yyyy-mm-dd hh24:00:00')::timestamptz,
to_char(now(),'yyyy-mm-dd hh24:00:00')::timestamptz - interval '4 hours',
interval '-1 hour');
Convert epoch time to postgres timestamp:
select timestamptz 'epoch' + etime * '1 second'::interval from epoch_t;
then truncate to hour :
select to_char(timestamptz 'epoch' + etime * '1 second'::interval,
'yyyy-mm-dd hh24:00:00')::timestamptz from epoch_t
To provide summary information by hour :
select to_char(timestamptz 'epoch' + etime * '1 second'::interval,
'yyyy-mm-dd hh24:00:00')::timestamptz,
count(*)
from epoch_t
group by 1
order by 1 desc;
If you might have gaps in the data but need to report zero results use a generate_series to create period buckets and left join to data table.
In this case I create sample hour buckets back prior to the data population above - 9 hours instead of 6 and join on the conversion of epoch time to timestamp truncated to hour.
select per.sample_hour,
sum(case etime is null when true then 0 else 1 end) as etcount
from (select generate_series(to_char(now(),
'yyyy-mm-dd hh24:00:00')::timestamptz,
to_char(now(),'yyyy-mm-dd hh24:00:00')::timestamptz - interval '9 hours',
interval '-1 hour') as sample_hour) as per
left join epoch_t on to_char(timestamptz 'epoch' + etime * '1 second'::interval,
'yyyy-mm-dd hh24:00:00')::timestamptz = per.sample_hour
group by per.sample_hour
order by per.sample_hour desc;