KDB - query day by day and join results together - kdb

I have a relatively small table (t1) and want to join a large time series (t2) with it through an as-of-join. The timeseries is too large to do it in one go so I want to split the operation up into daily chunks.
Given a list of dates, I want to execute the same query for each date:
aj[`Id`Timestamp;select from t1 where date=some_date;select from t2 where date=some_date]
Ideally this should return a list of tables l so that I can simply join them:
l[0] uj/ 1_l

I believe something like this should work:
raze{aj[`Id`Timestamp;select from t1 where date=x;select from t2 where date=x]
}each exec distinct date from t1

Related

How to update counts by date in table A with the counts by date returned from join of table B and table C

I can do this using a temporary table. Is it possible to do these two steps in a single update query?
All possible dates already exist in the TargetTable (no inserts are necessary).
I'm hoping to make this more efficient since it is run often as batches of data periodically pour into table T2.
Table T1: list of individual dates inserted or updated in this batch
Table T2: datetime2(3) field followed by several data fields, may be thousands for any particular date
Goal: update TargetTable: date field followed by int field to hold the total records by date (may have just come in to T2 or may be additional records appended to existing records already in T2)
select T1.date as TargetDate, count(*) as CountF1
into #Temp
from T1 inner join T2
on T1.date = cast(T2.DateTime as date)
group by T1.date
update TargetTable
set TargetField1 = CountF1
from #Temp inner join TargetTable
on TargetDate = TargetTable.Date
I agree with the recommendation of Zohar Peled. Use a "Common Table Expression" which is often abbreviated as "CTE". A CTE can replace the temporary table in your scenario. You write a CTE by using the WITH keyword, and remember that in many cases you will need to have a semicolon before the WITH keyword (or at the end of the previous statement, if you prefer). The solution then looks like this:
;WITH CTE AS
(
SELECT T1.date AS TargetDate, Count(*) AS CountF1
FROM T1 INNER JOIN T2
ON T1.date = Cast(T2.DateTime AS DATE)
GROUP BY T1.date
)
UPDATE TargetTable
SET TargetField1 = CTE.CountF1
FROM CTE INNER JOIN TargetTable
ON CTE.TargetDate = TargetTable.Date;
Here is more information on Common Table Expressions:
https://learn.microsoft.com/en-us/sql/t-sql/queries/with-common-table-expression-transact-sql
After having done this, then another thing you might benefit from is to add a new column to table T2, with the datatype DATE. This new column could have the value of Cast(T2.DateTime AS DATE). It might even be a (persisted) computed column. Then add an index on that new column. If you then join on the new column (instead of joining on the expression Cast(...) ) it might run faster depending on the distribution of the data. The only way to tell if it runs faster is to try it out.

Is there a way to optimize this T-SQL query to use less spool space?

Running out of spool space wondering if the query can be optimized.
I've tried running a DISTINCT and UNION ALL, Group By doesn't make sense.
SELECT DISTINCT T1.EMAIL, T2.BILLG_STATE_CD, T2.BILLG_ZIP_CD
FROM
(SELECT EMAIL
FROM CAT
UNION ALL
SELECT EMAIL
FROM DOG
UNION ALL
SELECT email As EMAIL
FROM MOUSE) As T1
LEFT JOIN HAMSTER As T2 ON T1.EMAIL =T2.EMAIL_ADDR;
I will need to do this same type of data pull often, looking for a viable solution other than doing three separate joins.
I need to union multiple tables (T1) and join columns from another table (T2) on (T1).
WHERE T2.ord_creatd_dt > DATE '2019-01-01' and T2.ord_creatd_dt < DATE '2019-11-08'

KDB - Mapping a column to a table using timestamps

Imagine two kdb tables, (t1) is recording tick data (security prices from diff sources, i.e. multiple columns) with a timestamp, (t2) is recording trades with a timestamp.
My goal:
Append a column to t2 such that it will, for each timestamp in t2, extract the value from one column in t1 where the timestamp is closest to (or matches) the timestamp in t2. So I almost want to map the value of a certain column in t1 to t2 based on the timestamp.
I appreciate this is a bit convoluted but was thinking there might be a way other than running a query for each entry in t2.
Thanks!
This may not be exactly what you are looking for, but it might be helpful to consider an as of join:
aj[`sym`time;t2;t1]
Assuming the records are sequenced by the time column in both tables, this command will return the row in t1 which is in effect “as of” the time in t2.
Specifically, for a given time value in t2, the match picks the greatest time in t1 less than or equal to the given value in t2.
For further reading, please refer to https://code.kx.com/q/ref/joins/#aj-aj0-ajf-ajf0-asof-join

DB2 Joining a Large Physical table with a small Global Temp table

I have the below requirement of Joining 3 tables
a) Table T1 - large physical table with 100 Million rows
Index columns: C1, C2, C3 in this order
b) Table T2 - Temp table with 50 records
contains C2 & additional columns. No Index
c) Table T3 - Temp table with 100 records
contains C3 & additional columns. No Index
Tables T2 and T3 have no common columns
I tried to extract data from T1, T2, T3 as below:
Select T1.*, T2.*, T3.*
from T1
Inner join T2 (on T1.C2 = T2.C2)
Inner join T3 (T1.C3 = T3.C3)
where
T1.C1 = a constant value (coming from my program).
Explain of above query shows that on T1, Index scan was performed using only 1 column. (I believe it is T1.C3 as i provided WHERE clause)
The query is executing fine but taking slightly longer. Is there a better way to code the query for above requirement?
Any inputs are greatly appreciated
You mention you're using a temp table. Did you run RUNSTATS on the temporary tables, including collecting column statistics?
An index scan matching on one column has to be against T1 matching on column 1, since that is the leading column of the index. When examining explain, you should also pay attention to PRIMARY_ACCESSTYPE. Db2 may choose to scan one or all of T1, T2, and T3 and create a sparse index, which would be reflected with PRIMARY_ACCESSTYPE = T in the PLAN_TABLE.
What is the cardinality of the 3 column index on the 100 million row table? Is it unique? Is it highly selective - close to table size of 100 million rows, or would significant duplicate rows qualify for every probe?
Accurate statistics are important in this scenario. The cost of cartesian join is quite high, so it's important Db2 understands how small the temporary tables are, and how selective the join columns are in considering the access path to choose. If no statistics are collected on tables T2 and T3, Db2 by defauls assumes 10,000 rows in a table. Then a cartesian join of T2 and T3 would be estimated at 10,000 * 10,000 rows = 100 million, then it would make sense for Db2 to just access that table once using the local filter, then join to T2 and T3, possibly with a sparse index.
If collecting statistics does not resolve, please update the question with plan table results.

Divide count of Table 1 by count of Table 2 on the same time interval in Tableau

I have two tables with IDs and time stamps. Table 1 has two columns: ID and created_at. Table 2 has two columns: ID and post_date. I'd like to create a chart in Tableau that displays the Number of Records in Table 1 divided by Number of Records in Table 2, by week. How can I achieve this?
One way might be to use Custom SQL like this to create a new data source for your visualization:
SELECT created_table.created_date,
created_table.created_count,
posted_table.posted_count
FROM (SELECT TRUNC (created_at) AS created_date, COUNT (*) AS created_count
FROM Table1) created_table
LEFT JOIN
(SELECT TRUNC (post_date) AS posted_date, COUNT (*) AS posted_count
FROM Table2) posted_table
ON created_table.created_date = posted_table.posted_date
This would give you dates and counts from both tables for those dates, which you could group using Tableau's date functions in the visualization. I made created_table the first part of the left join on the assumption that some records would be created and not posted, but you wouldn't have posts without creations. If that isn't the case you will want a different join.