KDB select only rows with max value on a column elegantly - kdb

I have this table for stock prices (simplified version here):
+----------+--------+-------+
| Time | Ticker | Price |
+----------+--------+-------+
| 10:00:00 | A | 5 |
| 10:00:01 | A | 6 |
| 10:00:00 | B | 3 |
+----------+--------+-------+
I want to select the row group by Ticker with maximum Time, e.g.
+----------+--------+-------+
| Time | Ticker | Price |
+----------+--------+-------+
| 10:00:01 | A | 6 |
| 10:00:00 | B | 3 |
+----------+--------+-------+
I know how to do it in SQL, similar question can be found here , but I have no idea how to do elegantly it in KDB.
I have a solution that do selection twice:
select first Time, first Ticker, first Price by Ticker from (`Time xdesc select Time, Ticker, Price from table where date=2018.06.21)
Is there more clean solution?

Whenever you're doing a double select involving a by, it's a good sign that you can instead use fby
q)t:([]time:10:00:00 10:00:01 10:00:00;ticker:`A`A`B;price:5 6 3)
q)
q)select from t where time=(max;time) fby ticker
time ticker price
---------------------
10:00:01 A 6
10:00:00 B 3
Kdb also offers a shortcut of taking last records whenever do you a select by with no specified columns but this approach isn't as general or customizable
q)select by ticker from t
ticker| time price
------| --------------
A | 10:00:01 6
B | 10:00:00 3

One additional thing to note, select by can give wrong results if the data is not sorted correctly.
e.g.
select by ticker from reverse[t]
ticker| time price
------| --------------
A | 10:00:00 5 //wrong result
B | 10:00:00 3
The fby can get the correct results regardless of the order:
select from (reverse t) where time=(max;time) fby ticker
time ticker price
---------------------
10:00:00 B 3
10:00:01 A 6

Related

get monthly and weekly average working hours in postgresql

I have a table that has following columns:- local_id | time_in | time_out | date | employee_id
I have to calculate average working hours(which will be calculated by time_out and time_in) on a monthly basis in PSQL. I have no clue how to do that, was thinking about using date_part function...
here are the table details:
local_id | time_in | time_out | date | employee_id
---------+----------+----------+------------+-------------
7 | 08:00:00 | 17:00:00 | 2020-02-12 | 2
6 | 08:00:00 | 17:00:00 | 2020-02-12 | 4
8 | 09:00:00 | 17:00:00 | 2020-02-12 | 3
13 | 08:05:00 | 17:00:00 | 2020-02-17 | 3
12 | 08:00:00 | 18:09:00 | 2020-02-13 | 2
Click: demo:db<>fiddle; extended example covering two months
SELECT
employee_id,
date_trunc('month', the_date) AS month, -- 1
AVG(time_out - time_in) -- 2, 3
FROM
mytable
GROUP BY employee_id, month -- 3
date_trunc() "shortens" the date to a certain date part. In that case, all dates are truncated to the month. This gives the opportunity to group by month. (for your "monthly basis")
Calculate the working time by calculating the difference of both times
Grouping by employee_id and calculated month, calculating the average of the time differences.

How to check if there's a record in every hour in specified time-frame and then count it?

I'm using PostgreSQL and this is my table measurement_archive:
+-----------+------------------------+------+-------+
| sensor_id | time | type | value |
+-----------+------------------------+------+-------+
| 123 | 2017-11-26 01:53:11+00 | PM25 | 34.32 |
+-----------+------------------------+------+-------+
| 123 | 2017-11-26 02:15:11+00 | PM25 | 32.1 |
+-----------+------------------------+------+-------+
| 123 | 2017-11-26 04:32:11+00 | PM25 | 75.3 |
+-----------+------------------------+------+-------+
I need a query that will take records from specified timeframe (eg. from 2017-01-01 00:00:00 to 2017-12-01 23:59:59) and then check if in every hour there is at least 1 record - if there is, then add 1 to result.
So, if I make that query from 2017-11-26 01:00:00 to 2017-11-26 04:59:59+00 for sensor_id == 123 on above table then the result should be 3.
select count(*)
from (
select date_trunc('hour', time) as time
from measurement_archive
where
time >= '2017-11-26 01:00:00' and time < '2017-11-26 05:00:00'
and
sensor_id = 123
group by 1
) s
alternative solution would be using distinct,
select count(*) from (select distinct a, extract(hour from time) from t where time >'2017-11-26 01:00:11' and time <'2017-11-26 05:00:00' and sensor_id=123)t;

Use another table's data in postgreSQL

I have a event table and a transaction log.
And I want count each event's total revenue by one sql.
Could anything tell how to do this.
please be ware there will be more than 100,000 logs in transaction table.
event_table:
Event_id | start_date | end_date
------------------------
11111 | 2013-01-04 | 2013-01-05
11112 | 2013-01-08 | 2013-01-10
11113 | 2013-01-11 | 2013-01-12
11114 | 2013-01-15 | 2013-01-18
11115 | 2013-01-19 | 2013-01-21
11116 | 2013-01-22 | 2013-01-24
11117 | 2013-01-26 | 2013-01-29
transaction_log:
id | name | time_created | Cost
------------------------
1 | michael | 2013-01-04 | 1
2 | michael | 2013-01-08 | 4
3 | mary | 2013-01-11 | 5
4 | john | 2013-01-15 | 2
5 | michael | 2013-01-19 | 3
6 | mary | 2013-01-22 | 2
7 | john | 2013-01-26 | 4
I tried to use the sql like following, but it does not work.
select
event_table.id,
( select sum(Cost)
from transaction_log
where date(time_created) between transaction_log.start_date and transaction_log.end_date ) as revenue
from event_table
It is failing because the fields start_date and end_date are from event_table but you're stating them as transaction_log.start_date and transaction_log.end_date. This will work:
select
event_table.id,
( select sum(Cost)
from transaction_log
where date(time_created) between event_table.start_date and event_table.end_date ) as revenue
from event_table
There is no need to cast time_created as date (date(time_created)) if it is already of date data type. Otherwise, if time_created is timestamp or timestamptz, then for performance you may want to consider doing:
select
event_table.id,
( select sum(Cost)
from transaction_log
where time_created >= event_table.start_date::timestamptz and time_created < (event_table.end_date+1)::timestamptz ) as revenue
from event_table
Also for performance, when executing a query like the one above, PostgreSQL is executing a subquery for each row of the main query (in this case the event_table table). Joining and using GROUP BY will generally provide you with better results:
select e.id, sum(l.Cost) as revenue
from event_table e
join transaction_log l ON (l.time_created BETWEEN e.start_date AND e.end_date)
group by e.id

Join column with timestamps where value is maximum

I have a table that looks like
+-------+-----------+
| value | timestamp |
+-------+-----------+
and I'm trying to build a query that gives a result like
+-------+-----------+------------+------------------------+
| value | timestamp | MAX(value) | timestamp of max value |
+-------+-----------+------------+------------------------+
so that the result looks like
+---+----------+---+----------+
| 1 | 1.2.1001 | 3 | 1.1.1000 |
| 2 | 5.5.1021 | 3 | 1.1.1000 |
| 3 | 1.1.1000 | 3 | 1.1.1000 |
+---+----------+---+----------+
but I got stuck on joining the column with the corresponding timestamps.
Any hints or suggestions?
Thanks in advance!
For further information (if that helps):
In the real project the max-values are grouped by month and day (with group by clause, which works btw), but somehow I got stuck on joining the timestamps for max-values.
EDIT
Cross joins are a good idea, but I want to have them grouped by month e.g.:
+---+----------+---+----------+
| 1 | 1.1.1101 | 6 | 1.1.1300 |
| 2 | 2.6.1021 | 5 | 5.6.1000 |
| 3 | 1.1.1200 | 6 | 1.1.1300 |
| 4 | 1.1.1040 | 6 | 1.1.1300 |
| 5 | 5.6.1000 | 5 | 5.6.1000 |
| 6 | 1.1.1300 | 6 | 1.1.1300 |
+---+----------+---+----------+
EDIT 2
I've added a fiddle for some sample data and and example of the current query.
http://sqlfiddle.com/#!1/efa42/1
How to add the corresponding timestamp to the maximum?
Try a cross join with two sub queries, the first one selects all records, the second one gets one row that represents the time_stamp of the max value, <3;"1000-01-01"> for example.
SELECT col_value,col_timestamp,max_col_value, col_timestamp_of_max_value FROM table1
cross join
(
select max(col_value) max_col_value ,col_timestamp col_timestamp_of_max_value from table1
group by col_timestamp
order by max_col_value desc
limit 1
) A --One row that represents the time_stamp of the max value, ie: <3;"1000-01-01">
Use the window cause you use with pg
Select *, max( value ) over (), max( timestamp ) over() from table
That gives you the max values from all values in every row
http://www.postgresql.org/docs/9.1/static/tutorial-window.html

Temporal Aggregation in PostgreSQL

I am working on a Java implementation for temporal aggregation using a PostgreSQL database.
My table looks like this
Value | Start | Stop
(int) | (Date) | (Date)
-------------------------------
1 | 2004-01-01 | 2010-01-01
4 | 2000-01-01 | 2008-01-01
So to visualize this periods:
------------------------------
----------------------------------------
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
[ 4 ][ 5=4+1 ][ 1 ]
My algorithm now calculates temporal aggregations of the data, e.g. SUM():
Value | Start | Stop
-------------------------------
4 | 2000-01-01 | 2004-01-01
5 | 2004-01-01 | 2008-01-01
1 | 2008-01-01 | 2010-01-01
In order to test the gained results, I now would like to query the data directly using PostgreSQL. I know that there is no easy way to this problem, yet. However, there surely is a way to get the same results. The aggregations Count, Max, Min, Sum and Average should be supported. I do not mind a bad or slow solution, it just has to work.
A query I found so far which should work similarly is the following:
select count(*), ts, te
from ( checkout a normalize checkout b using() ) checkoutNorm
group by ts, te;
My adoption looks like this:
select count(*), start, stop
from ( myTable a normalize myTable b using() ) myTableNorm
group by start, stop;
However, an error was reported ERROR: syntax error at or near "normalize" -- LINE 2: from ( ndbs_10 a normalize ndbs_10 b using() ) ndbsNorm.
Does anyone has a solution to this problem? It does not have to be based on the above query, as long as it works. Thanks a lot.
Your question was really hard to understand. But I think I figured it out.
You want a running sum over value. Values are only applicable between start and stop of a time period. So they have to be added at the begin of that period and deducted at the end.
In addition you want the begin and end of the resulting period the sum is valid for.
That should do it:
-- DROP SCHEMA x CASCADE;
CREATE SCHEMA x;
CREATE TABLE x.tbl(val int, start date, stop date);
INSERT INTO x.tbl VALUES
(4 ,'2000-01-01' ,'2008-01-01')
,(7 ,'2001-01-01' ,'2009-01-01')
,(1 ,'2004-01-01' ,'2010-01-01')
,(2 ,'2005-01-01' ,'2006-01-01');
WITH a AS (
SELECT start as ts, val FROM x.tbl
UNION ALL
SELECT stop, val * (-1) FROM x.tbl
ORDER BY 1, 2)
SELECT sum(val) OVER w AS val_sum
,ts AS start
,lead(ts) OVER w AS stop
FROM a
WINDOW w AS (ORDER BY ts)
ORDER BY ts;
val_sum | start | stop
--------+------------+------------
4 | 2000-01-01 | 2001-01-01
11 | 2001-01-01 | 2004-01-01
12 | 2004-01-01 | 2005-01-01
14 | 2005-01-01 | 2006-01-01
12 | 2006-01-01 | 2008-01-01
8 | 2008-01-01 | 2009-01-01
1 | 2009-01-01 | 2010-01-01
0 | 2010-01-01 |
Edit after request
For all requested aggregate functions:
SELECT period
,val_sum
,val_count
,val_sum::float /val_count AS val_avg
,(SELECT min(val) FROM x.tbl WHERE start < y.stop AND stop > y.start) AS val_min
,(SELECT max(val) FROM x.tbl WHERE start < y.stop AND stop > y.start) AS val_max
,start
,stop
FROM (
WITH a AS (
SELECT start as ts, val, 1 AS c FROM x.tbl
UNION ALL
SELECT stop, val, -1 FROM x.tbl
ORDER BY 1, 2)
SELECT count(*) OVER w AS period
,sum(val*c) OVER w AS val_sum
,sum(c) OVER w AS val_count
,ts AS start
,lead(ts) OVER w AS stop
FROM a
WINDOW w AS (ORDER BY ts)
ORDER BY ts
) y
WHERE stop IS NOT NULL;
period | val_sum | val_count | val_avg | val_min | val_max | start | stop
--------+---------+-----------+---------+---------+---------+------------+------------
1 | 4 | 1 | 4 | 4 | 4 | 2000-01-01 | 2001-01-01
2 | 11 | 2 | 5.5 | 4 | 7 | 2001-01-01 | 2004-01-01
3 | 12 | 3 | 4 | 1 | 7 | 2004-01-01 | 2005-01-01
4 | 14 | 4 | 3.5 | 1 | 7 | 2005-01-01 | 2006-01-01
5 | 12 | 3 | 4 | 1 | 7 | 2006-01-01 | 2008-01-01
6 | 8 | 2 | 4 | 1 | 7 | 2008-01-01 | 2009-01-01
7 | 1 | 1 | 1 | 1 | 1 | 2009-01-01 | 2010-01-01
min() and max could possibly be optimized, but that should be good enough.
CTE (WITH clause) and and subqueries are exchangeable, as you can see.