kdb/q building NBBO from TAQ data

kdb/q building NBBO from TAQ data - kdb

I have a table with bid/ask for for every stock/venue. Something like:
taq:`time xasc ([] time:10:00:00+(100?1000);bid:30+(100?20)%30;ask:30.8+(100?20)%30;stock:100?`STOCK1`STOCK2;exhcnage:100?`NYSE`NASDAQ)
How can I get the best/bid offer from all exchanges as of a time (in one minute buckets) for every stock?
My initial thought is to build a table that has a row for every minute/exchange/stock and do a asof join on the taq data. However, it sounds to me this is a brute force solution - since this is a solved problem, i figured i'd ask if there is a better way.

select max bid, min ask by stock,1+minute from 0!select by 1 xbar time.minute,stock,exchange from taq
This will give you the max bid, min ask across exchanges at the 1-minute interval in column minute.
The only tricky thing is the select by 1 xbar time.minute. When you do select by with no aggregation, it will just return the last row. So really what this means is select last time, last bid, last ask .... by 1 xbar time.minute etc.
So after we get the last values by minute and exchange, we just get the min/max across exchanges for that minute.

Related

Return at most 100 equally spaced rows from query

I have a table with time series data that looks like this:
id
pressure
date
location
873d8bc5-46fe-4d92-bb3b-4efeef3f1e8e
1.54
1999-01-08 04:05:06
"London"
c47c89f1-bdf8-420c-9237-32d70f9119f6
1.67
1999-01-08 04:05:06
"Paris"
54bbd56b-3269-4417-8422-3f1b7285e165
1.77
1999-01-08 04:05:06
"Berlin"
...
...
...
...
Now I would like to create a query for a given location between 2 dates that returns at most 100 results. If its less or equal than 100, simply return all the rows. If its more than 100, I'd like it to equally space them by date.
(ie if I select from 1.January 1970 to 1.January 2020 theres 18,993 days, so I'd like it to skip every 189th day so it returns exactly 100 records (cutting off the remaining 93 days).
So it returns rows like this:
id
pressure
date
location
873d8bc5-46fe-4d92-bb3b-4efeef3f1e8e
1.54
1970-01-01 04:05:06
London
8dc7c77b-6958-4cc7-914a-4b9c1f661200
1.1
1970-07-09 04:05:06
London
4e3d4c3b-a7e3-48bf-b6a3-5327cc606c82
1.23
1971-01-14 04:05:06
London
...
...
...
...
heres how far I got:
SELECT
pressure,
date
FROM
location
WHERE
location=$1
date>$2 AND date<$3
ORDER BY
date
The problem is, I see no way of achieving this without first loading all the rows, counting them and then sampling them out, which would be bad for perfomance (which means I'd have to load 18993 rows to return 100). So is there a way to efficiently load those hundred rows as you go?

I don't have a complete answer, but some ideas to start one:
SELECT MAX(date), MIN(date) FROM location gets you the first and last dates
something like SELECT MIN(date)::timestamp as beginning, (MAX(date)::timestamp - MIN(date)::timestamp)/COUNT(*) as avgdistance FROM location (no guarantee on syntax, can't check on a live DB right now) would get you the start point and the average distance between points.
The method to get every nth row is SELECT * FROM (SELECT *, row_number() over() rn FROM tab) foo WHERE foo.rn % {n} = 0; (with {n} replaced by your desired number).
You can probably replace that WHERE with something else than a row_number() check and that will get you somewhere.
Feel free to delete this answer if a real answer comes along, until then maybe this'll get you started.

I need to add up lots of values between date ranges as quickly as possible using PostgreSQL, what's the best method?

Here's a simple example of what I'm trying to do:
CREATE TABLE daily_factors (
factor_date date,
factor_value numeric(3,1));
CREATE TABLE customer_date_ranges (
customer_id int,
date_from date,
date_to date);
INSERT INTO
daily_factors
SELECT
t.factor_date,
(random() * 10 + 30)::numeric(3,1)
FROM
generate_series(timestamp '20170101', timestamp '20210211', interval '1 day') AS t(factor_date);
WITH customer_id AS (
SELECT generate_series(1, 100000) AS customer_id),
date_from AS (
SELECT
customer_id,
(timestamp '20170101' + random() * (timestamp '20201231' - timestamp '20170101'))::date AS date_from
FROM
customer_id)
INSERT INTO
customer_date_ranges
SELECT
d.customer_id,
d.date_from,
(d.date_from::timestamp + random() * (timestamp '20210211' - d.date_from::timestamp))::date AS date_to
FROM
date_from d;
So I'm basically making two tables:
a list of daily factors, one for every day from 1st Jan 2017 until today's date;
a list of 100,000 "customers" all who have a date range between 1st Jan 2017 and today, some long, some short, basically random.
Then I want to add up the factors for each customer in their date range, and take the average value.
SELECT
cd.customer_id,
AVG(df.factor_value) AS average_value
FROM
customer_date_ranges cd
INNER JOIN daily_factors df ON df.factor_date BETWEEN cd.date_from AND cd.date_to
GROUP BY
cd.customer_id;
Having a non-equi join on a date range is never going to be pretty, but is there any way to speed this up?
The only index I could think of was this one:
CREATE INDEX performance_idx ON daily_factors (factor_date);
It makes a tiny difference to the execution time. When I run this locally I'm seeing around 32 seconds with no index, and around 28s with the index.
I can see that this is a massive bottleneck in the system I'm building, but I can't think of any way to make things faster. The ideas I did have were:
instead of using daily factors I could largely get away with monthly ones, but now I have the added complexity of "whole months and partial months" to work with. It doesn't seem like it's going to be worth it for the added complexity, e.g. "take 7 whole months for Feb to Aug 2020, then 10/31 of Jan 2020 and 15/30 of September 2020";
I could pre-calculate every average I will ever need, but with 1,503 factors (and that will increase with each new day), that's already 1,128,753 numbers to store (assuming we ignore zero date ranges and that my maths is right). Also my real world system has an extra level of complexity, a second identifier with 20 possible values, so this would mean having c.20 million numbers to pre-calculate. Also, every day the number of values to store grows exponentially;
I could take this work out of the database, and do it in code (in memory), as it seems like a relational database might not be the best solution here?
Any other suggestions?

The classic way to deal with this is to store running sums of factor_value, not (or in addition to) individual values. Then you just look up the running sum at the two end points (actually at the end, and one before the start), and take the difference. And of course divide by the count, to turn it into an average. I've never done this inside a database, but there is no reason it can't be done there.

Count Until A Specific Value?

Say you've got a table ordered by the date that captures the speed of vehicles with a device in them. And imagine you get 30 updates per day for the speed. It's not always 30 per vehicle. The data will have the vehicle, the timestamp, and the speed.
What I want to do is be able to count how many days have passed since the vehicle last went over 10 mph in order to find inactive vehicles. Is something like that possible in postgresql?
*Or is there a way to get back the row number of the table if it's sorted where the speed goes past 10, and then select the date in that row number to subtract the current date from the date listed?

SELECT DISTINCT ON (vessel) vessel, now() - date
FROM your_table
WHERE speed > 10
ORDER BY vessel, date DESC
This will tell you, for every vehicle, how long ago its speed field was last over 10.

SELECT vessel, now() - max(date)
WHERE speed > 10
FROM your_table
GROUP BY vessel;

Web analytics schema with postgres

I am building a web analytics tool and use Postgresql as a database. I will not insert postgres each user visit but only aggregated data each 5 seconds:
time country browser num_visits
========================================
0 USA Chrome 12
0 USA IE 7
5 France IE 5
As you can see each 5 seconds I insert multiple rows (one per each dimensions combination).
In order to reduce the number of rows need to be scanned in queries, I am thinking to have multiple tables with the above schema based on their resolution: 5SecondResolution, 30SecondResolution, 5MinResolution, ..., 1HourResolution. Now when the user asks about the last day I will go to the hour resolution table which is smaller than the 5 sec resolution table (although I could have used that one too - it's just more rows to scan).
Now what if the hour resolution table has data on hours 0,1,2,3,... but users asks to see hourly trend from 1:59 to 8:59. In order to get data for the 1:59-2:59 period I could do multiple queries to the different resolutions tables so I get 1:59:2:00 from 1MinResolution, 2:00-2:30 from 30MinResolution and etc. AFAIU I have traded one query to a huge table (that has many relevant rows to scan) with multiple queries to medium tables + combine results on client side.
Does this sound like a good optimization?
Any other considerations on this?

Now what if the hour resolution table has data on hours 0,1,2,3,... but users asks to see hourly trend from 1:59 to 8:59. In order to get data for the 1:59-2:59 period I could do multiple queries to the different resolutions tables so I get 1:59:2:00 from 1MinResolution, 2:00-2:30 from 30MinResolution and etc.
You can't do that if you want your results to be accurate. Imagine if they're asking for one hour resolution from 01:30 to 04:30. You're imagining that you'd get the first and last half hour from the 5 second (or 1 minute) res table, then the rest from the one hour table.
The problem is that the one-hour table is offset by half an hour, so the answers won't actually be correct; each hour will be from 2:00 to 3:00, etc, when the user wants 2:30 to 3:30. It's an even more serious problem as you move to coarser resolutions.
So: This is a perfectly reasonable optimisation technique, but only if you limit your users' search start precision to the resolution of the aggregated table. If they want one hour resolution, force them to pick 1:00, 2:00, etc and disallow setting minutes. If they want 5 min resolution, make them pick 1:00, 1:05, 1:10, ... and so on. You don't have to limit the end precision the same way, since an incomplete ending interval won't affect data prior to the end and can easily be marked as incomplete when displayed. "Current day to date", "Hour so far", etc.
If you limit the start precision you not only give them correct results but greatly simplify the query. If you limit the end precision too then your query is purely against the aggregated table, but if you want "to date" data it's easy enough to write something like:
SELECT blah, mytimestamp
FROM mydata_1hour
WHERE mytimestamp BETWEEN current_date + INTERVAL '1' HOUR AND current_date + INTERVAL '4' HOUR
UNION ALL
SELECT sum(blah), current_date + INTERVAL '5' HOUR
FROM mydata_5second
WHERE mytimestamp BETWEEN current_date + INTERVAL '4' HOUR AND current_date + INTERVAL '5' HOUR;
... or even use several levels of union to satisfy requests for coarser resolutions.

You could use inheritance/partition. One resolution master table and many hourly resolution children tables ( and, perhaps, many minutes and seconds resolution children tables).
Thus you only have to select from the master table only, let the constraint of each children tables decide which is which.
Of course you have to add a trigger function to separate insert into appropriate children tables.
Complexities in insert versus complexities in display.
PostgreSQL - View or Partitioning?

Daily counts with TSQL?

I have a site where I record client metrics in a SQL Server 2008 db on every link clicked. I have already written the query to get the daily total clicks, however I want to find out how many times the user clicked within a given timespan (ie. within 5 seconds).
The idea here is to lock out incoming IP addresses that are trying to scrape content. It would be assumed that if more than 5 "clicks" is detected within 5 seconds or the number of daily clicks from a given IP address exceeds some value, that this is a scraping attempt.
I have tried a few variations of the following:
-- when a user clicked more than 5 times in 5 seconds
SELECT DATEADD(SECOND, DATEDIFF(SECOND, 0, ClickTimeStamp), 0) as ClickTimeStamp, COUNT(UserClickID) as [Count]
FROM UserClicks
WHERE DATEDIFF(SECOND, 0, ClickTimeStamp) = 5
GROUP BY IPAddress, ClickTimeStamp
This one in particular returns the following error:
Msg 535, Level 16, State 0, Line 3 The datediff function resulted in
an overflow. The number of dateparts separating two date/time
instances is too large. Try to use datediff with a less precise
datepart.
So once again, I want to use the seconds datepart, which I believe I'm on the right track, but not quite getting it.
Help appreciated. Thanks.
-- UPDATE --
Great suggestions and helped me think that the approach is wrong. The check is going to be made on every click. What I should do is for a given timestamp, check to see if in the last 5 seconds 5 clicks have been recorded from the same IP address. So it would be something like, count the number of clicks for > GetDate() - 5 seconds
Trying the following still isn't giving me an accurate figure.
SELECT COUNT(*)
FROM UserClicks
WHERE ClickTimeStamp >= GetDate() - DATEADD(SECOND, -5, GetDate())

Hoping my syntax is good, I only have oracle to test this on. I'm going to assume you have an ID column called user_id that is unique to that user (is it user_click_id? helpful to include table create statements in these questions when you can)
You'll have to preform a self join on this one. Logic will be take the userclick and join onto userclick on userId = userId and difference on clicktimestamp is between 0-5 seconds. Then it's counting from the subselect.
select u1.user_id, u1.clicktimestamp, u2.clicktimestamp
from userclicks uc1
left join user_clicks uc2
on u2.userk_id = u1.user_id
and datediff(second,u1.ClickTimeStamp,u2.ClickTimeStamp) <= 5
and datediff(second,u1.ClickTimeStamp,u2.ClickTimeStamp) > 0
This select statement should give you the user_id/clicktimestampe and 1 row for every record that is between 0 and 5 seconds apart from that clicktimestamp from the same user. Now it's just a matter of counting all user_id,u1.clicktimestamp combinations and highlighting the ones with 5 or more. Take the above query and turn it into a subselect and pull counts from it:
select u1.user_id, u1.clicktimestamp, count(1)
from
(select u1.user_id, u1.clicktimestamp
from userclicks uc1
left join user_clicks uc2
on u2.userk_id = u1.user_id
and datediff(second,u1.ClickTimeStamp,u2.ClickTimeStamp) <= 5
and datediff(second,u1.ClickTimeStamp,u2.ClickTimeStamp) > 0) a
group by u1.user_id, u1.clicktimestamp
having count(1) >= 5
Wish I could verify my syntax on a MS machine....there might be some typo's in there, but the logic should be good.

An answer for your UPDATE: the problem is in the third line of
SELECT COUNT(*)
FROM UserClicks
WHERE ClickTimeStamp >= GetDate() - DATEADD(SECOND, -5, GetDate())
GetDate() - DATEADD(SECOND, -5, GetDate()) is saying "take the current date time and subtract (the current date time minus five seconds)". I'm not entirely sure what kind of value this produces, but it won't be the one you want.
You still want some kind of time-period, perahps like so:
SELECT count(*)
from UserClicks
where IPAddress = #IPAddress
and ClickTimeStamp between getdate() and dateadd(second, -5, getdate())
I'm a bit uncomfortable using getdate() there--if you have a specific datetime value (accurate to the second), you should probably use it.

Assuming log entries are only entered for current activity -- that is, whenever a new row is inserted, the logged time is for that point in time and never for any prior point in time -- then you should only need to review data for a set period of time, and not have to review "all data" as you are doing now.
Next question is: how frequently do you make this check? If you are concerned with clicks per second, then something between "once per hour" and "once every 24 hours" seems reasonable.
Next up: define your interval. "All clicks per IPAddress within 5 seconds" could go two ways: set window (00-04, 05-09, 10-14, etc), or sliding window(00-04, 01-05, 02-06, etc.) Probably irrelevant with a 5 second window, but perhaps more relevant for longer periods (clicks per "day").
With that, the general approach I'd take is:
Start with earliest point in time you care about (1 hour ago, 24 hours ago)
Set up "buckets", means by which time windows can be identified (00:00:00 - 00:00:04, 00:00:05 - 00:00:09, etc.). This could be done as a temp table.
For all events, calculate number of elapsed seconds since your earliest point
For each bucket, count number of events that hit that bucket, grouped by IPAddress (inner join on the temp table on seconds between lowValue and highValue)
Identify those that exceed your threshold (having count(*) > X), and defenestrate them.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

kdb/q building NBBO from TAQ data - kdb

Related

Return at most 100 equally spaced rows from query

I need to add up lots of values between date ranges as quickly as possible using PostgreSQL, what's the best method?

Count Until A Specific Value?

Web analytics schema with postgres

Daily counts with TSQL?

Categories

Resources