asof (aj) join strictly less than in KDB/Q - kdb

I have a quote table and trade table, and would like to list the quotes table and join in the trades table matching on timestamps strictly less than the timestamp of the trade.
For example:
q:([]time:10:00:00 10:01:00 10:01:00 10:01:02;sym:`ibm`ibm`ibm`ibm;qty:100 200 300 400)
t:([]time:10:01:00 10:01:00 10:01:02;sym:`ibm`ibm`ibm;px:10 20 25)
aj[`time;q;t]
returns
+------------+-----+-----+----+
| time | sym | qty | px |
+------------+-----+-----+----+
| 10:00:00 | ibm | 100 | |
| 10:01:00 | ibm | 200 | 20 |
| 10:01:00 | ibm | 300 | 20 |
| 10:01:02 | ibm | 400 | 25 |
+------------+-----+-----+----+
But I'm trying to get a result like:
+------------+-----+-----+----+
| time | sym | qty | px |
+------------+-----+-----+----+
| 10:00:00 | ibm | 100 | |
| 10:01:00 | ibm | 100 | 10 |
| 10:01:00 | ibm | 100 | 20 |
| 10:01:02 | ibm | 300 | 25 |
+------------+-----+-----+----+
Is there a join function that can match based on timestamps that are strictly less than time instead up-to and including?

I think if you do some variation of aj[`time;q;t] then you won't be able to modify the qty column as table t does not contain it. Instead you may need to use the more "traditional" aj[`time;t;q]:
q)#[;`time;+;00:00:01]aj[`time;#[t;`time;-;00:00:01];q]
time sym px qty
-------------------
10:01:00 ibm 10 100
10:01:00 ibm 20 100
10:01:02 ibm 25 300
This shifts the times to avoid matching where they are equal but does not contain a row for each quote you had in the beginning.
I think if you wish to join trades to quotes rather than quotes to trades as I have done you may need to think of some method of differentiating between 2 trades that occur at the same time as in your example. One method to do this may be to use the order they arrive, i.e. match first quote to first trade.

One “hacking” way I’m thinking is to just shift all trades by the minimum time unit do the aj and then shift back

Related

Increment date within while loop using postgresql on Redshift table

MY SITUATION:
I have written a piece of code that returns a dataset containing a web user's aggregated activity for the previous 90 days and returns a score, subsequent to some calculation. Essentially, like RFV.
A (VERY) simplified version of the code can be seen below:
WITH start_data AS (
SELECT user_id
,COUNT(web_visits) AS count_web_visits
,COUNT(button_clicks) AS count_button_clicks
,COUNT(login) AS count_log_in
,SUM(time_on_site) AS total_time_on_site
,CURRENT_DATE AS run_date
FROM web.table
WHERE TO_CHAR(visit_date, 'YYYY-MM-DD') BETWEEN DATEADD(DAY, -90, CURRENT_DATE) AND CURRENT_DATE
AND some_flag = 1
AND some_other_flag = 2
GROUP BY user_id
ORDER BY user_id DESC
)
The output might look something like the below:
| user_id | count_web_visits | count_button_clicks | count_log_in | total_time_on_site | run_date |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 256 | 932 |16 | 1200 | 23-01-20 |
| 2391823 | 710 | 1345 |308 | 6000 | 23-01-20 |
| 3729128 | 67 | 204 |83 | 320 | 23-01-20 |
| 5561296 | 437 | 339 |172 | 3600 | 23-01-20 |
This output is then stored in it's own AWS/Redhsift table and will form base table for the task.
SELECT *
into myschema.base_table
FROM start_data
DESIRED OUTPUT:
What I need to be able to do, is iteratively run this code such that I append new data to myschema.base_table, every day, for the previous 90's day aggregation.
The way I see it, I can either go forwards or backwards, it doesn't matter.
That is to say, I can either:
Starting from today, run the code, everyday, for the preceding 90 days, going BACK to the (first date in the table + 90 days)
OR
Starting from the (first date in the table + 90 days), run the code for the preceding 90 days, everyday, going FORWARD to today.
Option 2 seems the best option to me and the desired output looks like this (PARTITION FOR ILLUSTRATION ONLY):
| user_id | count_web_visits | count_button_clicks | count_log_in | total_time_on_site | run_date |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 412 | 339 |180 | 3600 | 20-01-20 |
| 2391823 | 417 | 6253 |863 | 2400 | 20-01-20 |
| 3729128 | 67 | 204 |83 | 320 | 20-01-20 |
| 5561296 | 281 | 679 |262 | 4200 | 20-01-20 |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 331 | 204 |83 | 3200 | 21-01-20 |
| 2391823 | 652 | 1222 |409 | 7200 | 21-01-20 |
| 3729128 | 71 | 248 |71 | 720 | 21-01-20 |
| 5561296 | 366 | 722 |519 | 3600 | 21-01-20 |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 213 | 808 |57 | 3600 | 22-01-20 |
| 2391823 | 817 | 4265 |476 | 1200 | 22-01-20 |
| 3729128 | 33 | 128 |62 | 120 | 22-01-20 |
| 5561296 | 623 | 411 |283 | 2400 | 22-01-20 |
|---------|------------------|---------------------|--------------|--------------------|----------|
| 1234567 | 256 | 932 |16 | 1200 | 23-01-20 |
| 2391823 | 710 | 1345 |308 | 6000 | 23-01-20 |
| 3729128 | 67 | 204 |83 | 320 | 23-01-20 |
| 5561296 | 437 | 339 |172 | 3600 | 23-01-20 |
WHAT I HAVE TRIED:
I have successfully created a WHILE loop to sequentially increment the date as follows:
CREATE OR REPLACE PROCEDURE retrospective_data()
LANGUAGE plpgsql
AS $$
DECLARE
start_date DATE := '2020-11-20' ;
BEGIN
WHILE CURRENT_DATE > start_date
LOOP
RAISE INFO 'Date: %', start_date;
start_date = start_date + 1;
END LOOP;
RAISE INFO 'Loop Statment Executed Successfully';
END;
$$;
CALL retrospective_data();
Thus producing the dates as follows:
INFO: Date: 2020-11-20
INFO: Date: 2020-11-21
INFO: Date: 2020-11-22
INFO: Date: 2020-11-23
INFO: Date: 2020-11-24
INFO: Date: 2020-11-25
INFO: Date: 2020-11-26
INFO: Loop Statment Executed Successfully
Query 1 OK: CALL
WHAT I NEED HELP WITH:
I need to be able to apply the WHILE loop to the initial code such that the WHERE clause becomes:
WHERE TO_CHAR(visit_date, 'YYYY-MM-DD') BETWEEN DATEADD(DAY, -90, start_date) AND start_date
But where start_date is the result of each incremental loop. Additionally, the result of each execution needs to be appended to the previous.
Any help appreciated.
It is fairly clear that you come from a procedural programming background and this first recommendation is to stop thinking in terms of loops. Databases are giant and powerful data filtering machines and thinking in terms of 'do step 1, then step 2' often leads to missing out on all this power.
You want to look into window functions which allow you to look over ranges of other rows for each row you are evaluating. This is exactly what you are trying to do.
Also you shouldn't cast a date to a string just to compare it to other dates (WHERE clause). This is just extra casting and defeats Redshift's table scan optimizations. Redshift uses block metadata that optimizes what data is needed to be read from disk but this cannot work if the column is being cast to another data type.
Now to your code (off the cuff rewrite and for just the first column). Be aware that group by clauses run BEFORE window functions and that I'm assuming that not all users have a visit every day. And since Redshift doesn't support RANGE in window functions will need to make sure all dates are represented for all user-ids. This is done by UNIONing with a sufficient number of rows that covers the date range. You may have a table like this or may want to create one but I'll just generate something on the fly to show the process (and this process makes the assumption that there are fewer dense dates than rows in the table - likely but not iron clad).
SELECT user_id
,COUNT(web_visits) AS count_web_visits_by_day,
,SUM(count_web_visits_by_day) OVER (partition by user_id order by visit_date rows between 90 preceding and current row)
...
,visit_date
FROM (
SELECT visit_date, user_id, web_visits, ...
FROM web.table
WHERE some_flag = 1 AND some_other_flag = 2
UNION ALL -- this is where I want to union with a full set of dates by user_id
( SELECT visit_date, user_id, NULL as web_visits, ...
FROM (
SELECT DISTINCT user_id FROM web.table
CROSS JOIN
SELECT CURRENT_DATE + 1 - row_number() over (order by visit_date) as visit_date
FROM web.table
)
)
)
GROUP BY visit_date, user_id
ORDER BY visit_date ASC, user_id DESC ;
The idea here is to set up your data to ensure that you have at least one row for each user_id for each date. Then the window functions can operate on the "grouped by date and user_id" information to sum and count over the past 90 row (which is the same as past 90 days). You now have all the information you want for all dates where each is looking back over 90 days. One query to give you all the information, no while loop, no stored procedures.
Untested but should give you the pattern. You may want to massage the output to give you the range you are looking for and clean up NULL result rows.

How to get non-aggregated measures?

I calculate my metrics with SQL and publish the resulting table to Tableau Server. Afterward, use this data source to create charts and dashboards.
For one analysis, I already calculated the measures per day with SQL. When I use the resulting table in Tableau, it aggregates these measures to SUM by default. However, I don't want to have SUM or AVG of the average or SUM of the Percentiles.
What I want is the result when I don't select date dimension and not GROUP BY date in SQL as attached below.
Here is the query:
SELECT
-- date,
COUNT(DISTINCT id) AS count_of_id,
AVG(timediff_in_sec) AS avg_timediff,
PERCENTILE_CONT(0.25) WITHIN GROUP(ORDER BY timediff_in_sec) AS percentile_25,
PERCENTILE_CONT(0.50) WITHIN GROUP(ORDER BY timediff_in_sec) AS percentile_50
FROM
(
--subquery
) AS t1
-- GROUP BY date
Here are the first 10 rows of the resulting table:
+------------+--------------+-------------+---------------+---------------+
| date | avg_timediff | count_of_id | percentile_25 | percentile_50 |
+------------+--------------+-------------+---------------+---------------+
| 10/06/2020 | 61,65186364 | 22 | 8,5765 | 13,3015 |
| 11/06/2020 | 127,2913333 | 3 | 15,6045 | 17,494 |
| 12/06/2020 | 306,0348214 | 28 | 12,2565 | 17,629 |
| 13/06/2020 | 13,2664 | 5 | 11,944 | 13,862 |
| 14/06/2020 | 16,728 | 7 | 14,021 | 17,187 |
| 15/06/2020 | 398,6424595 | 37 | 11,893 | 19,271 |
| 16/06/2020 | 293,6925152 | 33 | 12,527 | 17,134 |
| 17/06/2020 | 155,6554286 | 21 | 13,452 | 16,715 |
| 18/06/2020 | 383,8101429 | 7 | 266,048 | 493,722 |
+------------+--------------+-------------+---------------+---------------+
How can I achieve the desired output above?
Drag them all into the dimensions list, then they will be static dimensions. For your use you could also just drag the Date field to Rows. Aggregating 1 value, which you have for each date, returns the same value whatever the aggregation type.

postgres lag when data is missing

I have data on baseball players annual salaries, with some years missing. What I would like to do is calculate the min, max, average change in salary from the prior year for all players in a year.
For example data looks like below from the table 'salaries':
| playerid | yearid | salary |
| a | 2016 | 10000 |
| b | 2016 | 5000 |
| a | 2015 | 9000 |
| b | 2015 | 3000 |
| a | 2014 | 3000 |
| b | 2014 | 15000 |
| a | 2010 | 1000 |
As you can see, player A has a yearly change of 1k and 6k. player B has a yearly change of 2k and -12k. So I would like a select statement that brings out:
| yearid | min change | max change | avg change |
| 2016 | 1k | 2k | 1.5k |
| 2015 | -12k | 6k | -9k |
Is there a way to do this?
My lag function has unfortunately captured the difference between 2014 and 2010 for playerid a and that is obviously wrong. I couldn't figure out how to use the lag function only if the previous row's yearid was 1 less than the current rows yearid.
Any suggestions would be greatly appreciated.
Just use the previous year for the filtering:
select year, min(salary - prev_salary), max(salary - prev_salary),
avg(salary - prev_salary)
from (select s.*,
lag(s.salary) over (partition by s.playerid order by yearid) as prev_salary,
lag(s.yearid) over (partition by s.playerid order by yearid) as prev_yearid
from salaries s
) s
where prev_yearid = yearid - 1;
Or, you can just use a join:
select s.yearid, . . .
from salaries s join
salaries sp
on sp.playerid = s.playerid and sp.yearid = s.yearid - 1
group by s.yearid;

SQL Database Design - Combinations of factors affecting a final value

The price of a service may depend on various combinations of factors. Example Tables:
ServiceType_Duration (combination of 2 factors)
+-------------+----------+--------+
| ServiceType | Duration | Amount |
+-------------+----------+--------+
| Massage | 30 | 30 |
| Massage | 60 | 50 |
| Reflexology | 30 | 50 |
| Reflexology | 60 | 70 |
+-------------+----------+--------+
DiscountCode (1 factor)
+--------------+--------+---------+
| DiscountCode | Amount | Percent |
+--------------+--------+---------+
| D1 | -10 | NULL |
| D2 | NULL | -10% |
+--------------+--------+---------+
In this very simple example, a 30 minute massage with discount code D1 would have total price 30 - 10 = 20.
However, there may be many more such tables of the general format:
Factor1
Factor2
...
FactorN
Amount (possibly)
Percent (possibly)
I'm not sure about having lots of 'pricing tables' when the general format is always going to be the same. It could mean having to add/remove the same column from lots of tables.
Are the tables above OK? If not, what's the best practice for storing this sort of data?

Tableau: calculated field after data is reshaped

I'm trying to wrap my head around how to created a calculated field in Tableau that is calculated after the source data is pivoted. My source data is "long" i.e. normalized and looks like this:
+---------+---------+-------+
| Company | Measure | Value |
+---------+---------+-------+
| A | Sales | 100 |
+---------+---------+-------+
| A | Exp | -10 |
+---------+---------+-------+
| B | Sales | 200 |
+---------+---------+-------+
| B | Exp | -30 |
+---------+---------+-------+
(Actually every company would have more than two records, but this is simplified)
What I'd like to get out is the following where Net is calculated as Sales + (2 * Exp).
+---------+---------+-------+-------+
| Company | Sales | Exp | Net |
+---------+---------+-------+-------+
| A | 100 | -10 | 80 |
+---------+---------+-------+-------+
| B | 200 | -30 | 140 |
+---------+---------+-------+-------+
I can get the following by simply having Company as my row and Measure as my column and then sum(Value):
+---------+---------+-------+
| Company | Sales | Exp |
+---------+---------+-------+
| A | 100 | -10 |
+---------+---------+-------+
| B | 200 | -30 |
+---------+---------+-------+
But how do I calculate an additional column based on the result of pivoting Measure?
Does this get you what you need?
The crux is creating calculated fields for Exp and Sales like this:
Exp =
if [Measure] = "Exp" then [Value] end
Sales =
if [Measure] = "Sales" then [Value] end
Those become measures you can use as the columns.