Median of time interval in Amazon Redshift? - amazon-redshift

I'm trying to get the median time interval in a group by. My dataset is two columns, column1 is user_ids and column2 is a time interval with the time that user spent on a website. When I group by id and call the MEDIAN function, Redshift throws an error stating that "median(interval)" is not allowed. I left the other columns out of the description since they dont really matter.

Interval is not a Redshift native data type - you cannot have a column of type interval. However, Redshift does understand interval literals. Just convert your timestamp differences into seconds (or minutes or hours or days), run MEDIAN(), and then display the result as an interval (if that is what you want).

Related

Get truncked data from a table - postgresSQL

I want to get truncked data over the last month. My time is in unix timestamps and I need to get data from last 30 days for each specific day.
The data is in the following form:
{
"id":"648637",
"exchange_name":"BYBIT",
"exchange_icon_url":"https://cryptowisdom.com.au/wp-content/uploads/2022/01/Bybit-colored-logo.png",
"trade_time":"1675262081986",
"price_in_quote_asset":23057.5,
"price_in_usd":1,
"trade_value":60180.075,
"base_asset_icon":"https://assets.coingecko.com/coins/images/1/large/bitcoin.png?1547033579",
"qty":2.61,
"quoteqty":60180.075,
"is_buyer_maker":true,
"pair":"BTCUSDT",
"base_asset_trade":"BTC",
"quote_asset_trade":"USDT"
}
I need to truncate data based on trade_time
How do I write the query?
The secret sauce is the date_trunc function, which takes a timestamp with time zone and truncates it to a specific precision (hour, day, week, etc). You can then group based on this value.
In your case we need to convert these unix timestamps javascript style timestamps to timestamp with time zone first, which we can do with to_timestamp, but it's still a fairly simple query.
SELECT
date_trunc('day', to_timestamp(trade_time / 1000.0)),
COUNT(1)
FROM pings_raw
GROUP BY date_trunc('day', to_timestamp(trade_time / 1000.0))
Another approach would be to leave everything as numbers, which might be marginally faster, though I find it less readable
SELECT
(trade_time/(1000*60*60*24))::int * (1000*60*60*24),
COUNT(1)
FROM pings_raw
GROUP BY (trade_time/(1000*60*60*24))::int

Subract Additional Time from $__timeFilter

I want to subract additional time in $__timeFilter in grafana. Like if I have selected Last 7 days, I want to run 2 queries which do a comparison like one query gives me avg cpu utilization for last 7 days and another one gives me avg cpu utilzation for now() - 14d to now() - 7d. And this is dynamic. I can get for 6hrs, 2days or anything selected.
My database is TimescaleDB and grafana version in 8.3.5
Edit
Query is
select avg(cpu) from cpu_utilization where $__timeFilter(timestamp)
Whatever is selected in the time filter in grafana, the query is manipulated accordingly
Now with grafana understands this query becomes as following. if I select last 24hrs
select avg(cpu) from cpu_utilization where timestamp BETWEEN '2022-09-07 05:32:10' and '2022-09-08 05:32:10'
This is normal behaviour. Now I wanted that if I select last 24hrs, this query to behave as it is but an additional query becomes
select avg(cpu) from cpu_utilization where timestamp BETWEEN '2022-09-06 05:32:10' and '2022-09-07 05:32:10'
(I just don't want it for last 24hrs, but any relative time period selected in the filter)
Answer : https://stackoverflow.com/a/73658919/14817486
You can use the global variables $__to and $__from.
For example, ${__from:date:seconds} will give you a timestamp in seconds. You can then subtract 7 days (= 604800 seconds) from it and use it in your query's WHERE clause. Depending on your SQL dialect, that might be by using TIMESTAMP(), TO_TIMESTAMP() or something similar. So it would look similar to this:
[...] WHERE timestamp BETWEEN TO_TIMESTAMP(${__from:date:seconds}-604800) AND TO_TIMESTAMP(${__to:date:seconds}-604800) [...]
Interesting question! If I understood correctly, you could use the timestamp column as the reference as the grafana is already filtering by this to the comparison query. So you can get the min(timestamp) and max(timestamp) to know the limits of your period and then build something from it.
Like min(timestamp) - INTERVAL '7 days' would give you the start of the previous range, and max(timestamp) - INTERVAL '7 days' would offer the final scope.

Azure Data Factory - calculate the number of days between two dates

I have to calculate the number of days between two dates and I search and I don't find any similar function available in ADF.
what I've noticed so far is that if I want to get the number of days between 2 columns, it means that the columns must be date columns, but I have timestamp columns (date + time)
how can I transform these columns into Date columns? or do you have other idea?
Using the fact that 86,400 is the number of seconds in a day
Now, using the function
ticks,
it returns the ticks property value for a specified timestamp. A tick
is a 100-nanosecond interval.
#string(div(sub(ticks(last_date),ticks(first_date)),864000000000))
Can re-format any type timestamp using function formatDateTime()
#formatDateTime(your_time_stamp,'yyyy-MM-dd HH:mm:ss')
Example:
#string(div(sub(ticks('2022-02-23 15:58:16'),ticks('2022-01-31 15:58:16')),864000000000))
This is the expression that I used for Data Flow.
toDate(toString({max_po create date},'yyyy-MM-dd')) - toDate(toString(max_datetimetoday,'yyyy-MM-dd'))
max_po, create date and max_datetimetoday are TimeStamp(date + time) columns.
The result is in days.

Get the last timestamps in a group by time query in Influxdb

I have a database with price and timestamps in nanoseconds measurement in InfluxDB. When I do a select grouped by time like this one:
select first(price),last(price) from priceseries where time>=1496815212834974866 and time<=1496865599580302882 group by time(1s)
I received a time column in which the timestamps is aligned to the second beginning the group. For example, the timestamp will be 08:00:00 and the next timestamps will be 08:00:01
How to
apply aggregation function on the record timestamps itself like last(time) or first(time) so that to have the real first and last timestamps of the group (I can have many prices within my group) ?
and how the time column in the response could be the closing second and not the opening second, that is if the group goes from 08:00:00 to 08:00:01, I want to see 08:00:01 in my time column instead of 08:00:00 which I see now ?
Not when using an aggregation function, which implies use of group by.
select first(price), last(price) where time >= <..> and time <= <..> will give you the first and last price within that time window.
When the query has a group by, the aggregation applies only to values within the intervals. The values themselves are the real values that fall in the 08:00:00 - 08:00:01 interval, it's just that the timestamp shown is for the interval itself, not the actual values.
Meaning that the query for between 08:00:00 and 08:00:01 without a group by and the query with a group by time(1s) for same period will give same result. Only difference is query without group by will have the value's actual timestamp and the group by query will have the interval's timestamp instead.
The timestamp when using group by indicates the starting time of the interval. From that, you can calculate end time is start time + interval. What timestamp to show is not configurable in the query language.

Web analytics schema with postgres

I am building a web analytics tool and use Postgresql as a database. I will not insert postgres each user visit but only aggregated data each 5 seconds:
time country browser num_visits
========================================
0 USA Chrome 12
0 USA IE 7
5 France IE 5
As you can see each 5 seconds I insert multiple rows (one per each dimensions combination).
In order to reduce the number of rows need to be scanned in queries, I am thinking to have multiple tables with the above schema based on their resolution: 5SecondResolution, 30SecondResolution, 5MinResolution, ..., 1HourResolution. Now when the user asks about the last day I will go to the hour resolution table which is smaller than the 5 sec resolution table (although I could have used that one too - it's just more rows to scan).
Now what if the hour resolution table has data on hours 0,1,2,3,... but users asks to see hourly trend from 1:59 to 8:59. In order to get data for the 1:59-2:59 period I could do multiple queries to the different resolutions tables so I get 1:59:2:00 from 1MinResolution, 2:00-2:30 from 30MinResolution and etc. AFAIU I have traded one query to a huge table (that has many relevant rows to scan) with multiple queries to medium tables + combine results on client side.
Does this sound like a good optimization?
Any other considerations on this?
Now what if the hour resolution table has data on hours 0,1,2,3,... but users asks to see hourly trend from 1:59 to 8:59. In order to get data for the 1:59-2:59 period I could do multiple queries to the different resolutions tables so I get 1:59:2:00 from 1MinResolution, 2:00-2:30 from 30MinResolution and etc.
You can't do that if you want your results to be accurate. Imagine if they're asking for one hour resolution from 01:30 to 04:30. You're imagining that you'd get the first and last half hour from the 5 second (or 1 minute) res table, then the rest from the one hour table.
The problem is that the one-hour table is offset by half an hour, so the answers won't actually be correct; each hour will be from 2:00 to 3:00, etc, when the user wants 2:30 to 3:30. It's an even more serious problem as you move to coarser resolutions.
So: This is a perfectly reasonable optimisation technique, but only if you limit your users' search start precision to the resolution of the aggregated table. If they want one hour resolution, force them to pick 1:00, 2:00, etc and disallow setting minutes. If they want 5 min resolution, make them pick 1:00, 1:05, 1:10, ... and so on. You don't have to limit the end precision the same way, since an incomplete ending interval won't affect data prior to the end and can easily be marked as incomplete when displayed. "Current day to date", "Hour so far", etc.
If you limit the start precision you not only give them correct results but greatly simplify the query. If you limit the end precision too then your query is purely against the aggregated table, but if you want "to date" data it's easy enough to write something like:
SELECT blah, mytimestamp
FROM mydata_1hour
WHERE mytimestamp BETWEEN current_date + INTERVAL '1' HOUR AND current_date + INTERVAL '4' HOUR
UNION ALL
SELECT sum(blah), current_date + INTERVAL '5' HOUR
FROM mydata_5second
WHERE mytimestamp BETWEEN current_date + INTERVAL '4' HOUR AND current_date + INTERVAL '5' HOUR;
... or even use several levels of union to satisfy requests for coarser resolutions.
You could use inheritance/partition. One resolution master table and many hourly resolution children tables ( and, perhaps, many minutes and seconds resolution children tables).
Thus you only have to select from the master table only, let the constraint of each children tables decide which is which.
Of course you have to add a trigger function to separate insert into appropriate children tables.
Complexities in insert versus complexities in display.
PostgreSQL - View or Partitioning?