Aggregation over aggregation - aggregate

I have time-series data (stock exchange trades) and I need to aggregate them by time interval: one minute, 5 minutes, 15 mins etc.
Senior time frame could be calculated from minor time frame, that is 5 x one-minutes -> 5 minutes.
I made MATERIALIZED VIEW, AggregatingMergeTree, which successfully calculates m1, like
maxState(price) as price_high, countState(item_id) as trades_count
But I don't how to make the next timeframes. If I use maxMerge in next view I return an incorrect result, which is fine as docs say I must use -state in AggregatingMergeTree, when I use -State in m5 too it complains on error.
I'd like to build series of materialized views, where minor view feeds senior one in a pipe with updates from trades
UPDATE (SQL):
CREATE MATERIALIZED VIEW IF NOT EXISTS candle_m1_state
ENGINE = AggregatingMergeTree() PARTITION BY toYYYYMM(toDateTime(timestamp_close_m1/1000))
ORDER BY (platform_id, symbol, timestamp_close_m1)
POPULATE AS
select
platform_id as platform_id,
symbol as symbol,
'1m' as `candle_interval`,
1000*toUnixTimestamp(toStartOfMinute(toDateTime(timestamp/1000))) as timestamp_m1,
1000*toUnixTimestamp(addMinutes(toStartOfMinute(toDateTime(timestamp/1000)), 1)) as timestamp_close_m1,
...
minState(price) as price_low,
countState(item_id) as trades_count
from trade
group by platform_id, symbol, timestamp_m1, timestamp_close_m1, `candle_interval`
order by timestamp_close_m1;
/*The one below definitely wrong due to -State suffix*/
CREATE MATERIALIZED VIEW IF NOT EXISTS candle_m5_test
ENGINE = AggregatingMergeTree() PARTITION BY toYYYYMM(toDateTime(timestamp_close_m5 / 1000))
ORDER BY (platform_id, symbol, timestamp_close_m5) SETTINGS index_granularity = 8192
POPULATE AS
SELECT platform_id, symbol, '5m' AS candle_interval,
1000 * toUnixTimestamp(toStartOfFiveMinute(toDateTime(timestamp_m1 / 1000))) AS timestamp_m5,
1000 * toUnixTimestamp(addMinutes(toStartOfFiveMinute(toDateTime(timestamp_m1 / 1000)), 5)) AS timestamp_close_m5,
...
minState(price_low) AS price_low,
countState(trades_count) AS trades_count
FROM candle_m1_state
GROUP BY platform_id, symbol, timestamp_m5, timestamp_close_m5
ORDER BY platform_id ASC, symbol ASC, timestamp_close_m5 ASC;

I wouldn't try to chain views. I'd do one view per aggregation.
Also have in mind that MATERIALIZED VIEW is rather trigger than view.
I'd recommend:
CREATE MATERIALIZED VIEW
stream__source__target_5m TO target_5m
AS
SELECT ...
CREATE MATERIALIZED VIEW
stream__source__target_1m TO target_1m
AS
SELECT ...
Etc.
where target_xm are your target tables.

It's obvious that select-query time for chain of materialized view I'd like to stick to that solution, rather than making view for each Time Frame (TF) aggregation from original data.
So solution is:
Original raw data -> TF1 materialized view (AggregatingMergeTree, -State suffix) -> TF2 (from TF1) (AggregatingMergeTree, -MergeState suffix)
Then queries form any TF1, TF2.. with -Merge suffix

Related

How do I merge a frequency count back onto a table in postgresql?

I am attempting to count the number of each category and merge it back onto the table by overwriting the table in postgresql.
This is the main table I have (Named Titanic, containing the columns in question):
PassengerId
Group
0001_01
1
0002_01
2
0003_01
3
0003_02
3
I've altered the table by adding a new numeric column "GroupSize" which I want to contain the frequency counts of each group category. So record 1, would be a count of 1, record 2 would be a count of 1 and record 3 and 4 would both be a count of 2. And I want my main "Titanic" table to be retained as opposed to creating a new table or view so ideally using an "Update" statement to impute values into "GroupSize";
I have created a view to contain group the corresponding frequency counts from this code:
CREATE OR REPLACE VIEW "GroupSize"("Group", "GroupSize") AS
select "Group", count("Group") from "Titanic" GROUP BY "Group";
which outputs this:
Group
GroupSize
1
1
2
1
3
2
And I've tried an Update statement to use this view to add data into my "GroupSize" column from "Titanic" like such:
UPDATE "Titanic"
SET "GroupSize" = (SELECT "GroupSize" from "GroupSize")
WHERE "Group" IN (SELECT "Group" from "GroupSize");
I have been unsuccessful in getting this UPDATE statement to work mainly because I get an error: "more than one row returned by a subquery used as an expression". I am pretty new to SQL so ny help would be appreciated.
You almost had it right. The value used in SET is dynamic based off the row being modified. All you have to do is add a WHERE clause to it to ensure it picks the right value from the view.
UPDATE "Titanic"
SET "GroupSize" = (
SELECT "GroupSize" from "GroupSize"
where "Titanic"."Group" = "GroupSize"."Group"
-- (Pedantic safety limit, just in case)
limit 1
)
Beware, though, this will modify every row, setting NULL for values not found in the view. To have it preserve "GroupSize" column for rows without a match in the view, tack on another WHERE clause:
UPDATE "Titanic"
SET "GroupSize" = (
SELECT "GroupSize" from "GroupSize"
where "Titanic"."Group" = "GroupSize"."Group"
limit 1
)
WHERE "Group" IN (SELECT "Group" from "GroupSize");
Do not actually Update you main table, just create the view to hold the group size. This eliminates maintenance headaches when performing DML on the table, image what extra you need to transfer one group to another. With the count only in the view, you do nothing extra. You get the count of id in each group with the window version of count. (see demo)
create or replace view titanic_vw as
select passengerid "Passenger Id"
, passenger_group "Group"
, count(*) over (partition by passenger_group) "Group Size"
from titanic;

How to query the first row efficiently?

I have a table with large amount of records:
date instrument price
2019.03.07 X 1.1
2019.03.07 X 1.0
2019.03.07 X 1.2
...
When I query for the day opening price, I use:
1 sublist select from prices where date = 2019.03.07, instrument = `X
It takes a long time to execute because it selects all the prices on that day and get the first one.
I also tried:
select from prices where date = 2019.03.07, instrument = `X, i = 0 //It does not return any record (why?)
select from prices where date = 2019.03.07, instrument = `X, i = first i //Seem to work. Does it?
In Oracle an equivalent will be:
select * from prices where date = to_date(...) and instrument = "X" and rownum = 1
and Oracle will stop immediately when it finds the first record.
How to do this in KDB (e.g. stop immediately after it finds the first record)?
In kdb, where subclauses in select statements are executed sequentially. i.e. only those records which pass the first "test" get passed to the second test. With that in mind, looking at your two attempts:
select from prices where date = 2019.03.07, instrument = `X, i = 0 //It does not return any record (why?)
This doesn't (necessarily) return anything, because by the time it gets to the i=0 check, you've already filtered out some records (possibly including the first record in the original table, which would have i=0)
select from prices where date = 2019.03.07, instrument = `X, i = first i //Seem to work. Does it?
This one should work. First you filter by date. Then within the records for that date, you select the records for instrument `X. Then within those records, you take the record where i is the first i (where i has already been filtered down, so first i is simply the index of the first record [still the index from the original table, not the filtered down version])
Q-SQL equivalent for that is select[n] which also performs better than other approaches in most of the cases. Positive 'n' will give first n records and negative will give last n records.
q) select[1] from prices where date = 2019.03.07, instrument = `X
There is no inbuilt functionality to stop after first match. You can write custom function for that but that would probably execute slower than above supported version.

OrientDB: Find Connected Components Values during the visit

I have schema with 3 main classes: Transaction , Address and ValueTx(Edge).
I am trying to find connected components within a range of time.
Now I am doing this query based on this one ( OrientDB: connected components OSQL query) :
SELECT distinct(traversedElement(0)) from ( TRAVERSE both('ValueTx') from (select * from Transaction where height >= 402041 and height <= 402044))
And this returns the rid of the 'head' of each trasversal and from it doing another DFS I can get every node and edge of the connected component I want to search about.
How can I, using the query above, also get the number of the transactions within the connected component and also the sum of their values? (The value of a tx is a property of the class Transaction)
I want to do something like:
SELECT distinct(traversedElement(0)) as head, count(Transaction), sum(valueTot) from ( TRAVERSE both('ValueTx') from (select * from Transaction where height >= 402041 and height <= 402044)) group by head
But of course is not working. I get only one row with the last head and the sum of all the transactions.
Thanks in advance.
Edit:
This is an example of what I'm looking for:
Connected Transactions
Every transaction there is within the same range of height:
Using my query ( the first one in my post) I get the rid of the first node of each group of transaction that are linked through several addresses.
example:
#15:27
#15:28
#15:30
#15:34
#15:35
#15:36
#15:37
#15:41
#15:47
#15:53
What I'm trying to get is a list of every first node with the total number of transactions (not addresses only the transaction) of the group it belongs to and the sum of the value of every Transaction (stored in valueTot inside the class transaction.
Edit2:
This is the dataset where I am making the tests:
The main problem is that I have a lot of data and the approach I was trying before (from every rid I make a different sql query) it's quite slow, I hope there is a faster way.
Edit3:
This is an updated sample db: Download
(note, it's way larger than the other)
select head, sum(valueTot) as valueTot, count(*) as numTx,sum(miner) as minerCount from (SELECT *,traversedElement(0) as head from ( TRAVERSE both('ValueTx') from (select * from Transaction where height >= 0 and height <= 110000 ) while ( #class = 'Address' or (#class = 'Transaction' and height >= 0 and height <= 110000 )) ) where #class = 'Transaction' ) group by head
This query on my system takes around one minute, also if I limit the result set, so I think the problem maybe in the internal query that selects the transactions that isn't using the indexes... Do you have any idea?
You can use this query
select #rid, $a[0].sum as sumValueTot ,$a[0].count as countTransaction from Transaction
let $a = ( select sum(valueTot),count(*) from (TRAVERSE both('ValueTx') from $parent.$current) where #class="Transaction")
where height >= 402041 and height <= 402044
Hope it helps.
is this what are you looking for?
select head, sum(valueTot), count(*) from (SELECT *,traversedElement(0) as head from ( TRAVERSE both('ValueTx') from (select * from Transaction where height >= 402041 and height <= 402044)) where #class = "Transaction") group by head

Conditional OR in the SQL Server Join – Multi-Value Parameters

I have an SSRS report with 4 parameters, two of which are multi-value parameters (#material and #color using VARCHAR(MAX) datatype in SQL Server 2008 R2). I am using a split function to return the value as a comma separated:
SELECT *
FROM MyView
WHERE height > 200
AND width > 100
AND (
material IN (SELECT Item FROM [dbo].[MySplitFunction] (#material, ',')) OR
color IN (SELECT Item FROM [dbo].[MySplitFunction] (#color, ','))
)
(The code above would return 50 records)
The problem with this approach is that these two multi-value parameters have around of 1,500 different colors and materials and degrade the performance. Sometimes, it takes more than 40 minutes to return the results (row count in the view around 600,000).
I tried a different approach where I used a temp table and used it in the JOIN instead of the WHERE clause:
SELECT Item
INTO #TempTable
FROM [dbo].[MySplitFunction] (#material, ',')
SELECT *
FROM MyView
INNER JOIN ON MyView.Item = #TempTable.Item
WHERE height > 200
AND width > 100
AND material IN (SELECT Item FROM [dbo].[MySplitFunction] (#material, ','))
(The code above would return 7 records only, but the performance is much better)
My question is how can I return the same number of records (50 rows) using the second approach by adding the other #color parameter and allowing the OR condition? So in the SSRS report, the user can multi select these two parameters and the query will return #material = values OR #color = Values.
I am open to any other approach as long as it speeds up the query and allows the OR condition for the two multi-value parameters (#material, #color).
Thanks!
Something like the following might do the trick. I'm not sure I have the syntax precisely right, and it wants further testing and analysis that I can't do without the proper structures and data...
SELECT
from MyVeiew
where height > 200
and width > 100
and (exists (select Item
from dbo.MySplitFunction(#material, ',')
where Item = material)
or exists (select Item
from dbo.MySplitFunction(#color, ',')
where Item = color)
)
This performs two correlated subqueries on nested function calls. Exists checks are generally faster than in lookups in these situations. The syntax bit that worries me is the "and (exists" bit -- that's the parenthesis for the OR clause, and combined with exists it looks a bit wonky.
I think it should do what you want, but testing is definitely called for.
I mistrust that or clause. To get rid of it, try this and see what happens:
SELECT * -- Better with specific columns
from MyView
where height > 200
and width > 100
and exists (select Item
from dbo.MySplitFunction(#material, ',')
where Item = material)
UNION select *
from MyView
where height > 200
and width > 100
and exists (select Item
from dbo.MySplitFunction(#color, ',')
where Item = color)
This runs and combines two queries, removing all duplicates -- pretty much the same as the OR clause would.
Next thing to check would be reviewing table sizes and checking indexes. You're filtering results on (only!) columns height, width, material, and color; if the table is huge, appropriate index would help here.

T-SQL Query to process data in batches without breaking groups

I am using SQL 2008 and trying to process the data I have in a table in batches, however, there is a catch. The data is broken into groups and, as I do my processing, I have to make sure that a group will always be contained within a batch or, in other words, that the group will never be split across different batches. It's assumed that the batch size will always be much larger than the group size. Here is the setup to illustrate what I mean (the code is using Jeff Moden's data generation logic: http://www.sqlservercentral.com/articles/Data+Generation/87901)
DECLARE #NumberOfRows INT = 1000,
#StartValue INT = 1,
#EndValue INT = 500,
#Range INT
SET #Range = #EndValue - #StartValue + 1
IF OBJECT_ID('tempdb..#SomeTestTable','U') IS NOT NULL
DROP TABLE #SomeTestTable;
SELECT TOP (#NumberOfRows)
GroupID = ABS(CHECKSUM(NEWID())) % #Range + #StartValue
INTO #SomeTestTable
FROM sys.all_columns ac1
CROSS JOIN sys.all_columns ac2
This will create a table with about 435 groups of records containing between 1 and 7 records in each. Now, let's say I want to process these records in batches of 100 records per batch. How can I make sure that my GroupID's don't get split between different batches? I am fine if each batch is not exactly 100 records, it could be a little more or a little less.
I appreciate any suggestions!
This will result in slightly smaller batches than 100 entries, it'll remove all groups that aren't entirely in the selection;
WITH cte AS (SELECT TOP 100 * FROM (
SELECT GroupID, ROW_NUMBER() OVER (PARTITION BY GroupID ORDER BY GroupID) r
FROM #SomeTestTable) a
ORDER BY GroupID, r DESC)
SELECT c1.GroupID FROM cte c1
JOIN cte c2
ON c1.GroupID = c2.GroupID
AND c2.r = 1
It'll select the groups with the lowest GroupID's, limited to 100 entries into a common table expression along with the row number, then it'll use the row number to throw away any groups that aren't entirely in the selection (row number 1 needs to be in the selection for the group to be, since the row number is ordered descending before cutting with TOP).