Imagine two kdb tables, (t1) is recording tick data (security prices from diff sources, i.e. multiple columns) with a timestamp, (t2) is recording trades with a timestamp.
My goal:
Append a column to t2 such that it will, for each timestamp in t2, extract the value from one column in t1 where the timestamp is closest to (or matches) the timestamp in t2. So I almost want to map the value of a certain column in t1 to t2 based on the timestamp.
I appreciate this is a bit convoluted but was thinking there might be a way other than running a query for each entry in t2.
Thanks!
This may not be exactly what you are looking for, but it might be helpful to consider an as of join:
aj[`sym`time;t2;t1]
Assuming the records are sequenced by the time column in both tables, this command will return the row in t1 which is in effect “as of” the time in t2.
Specifically, for a given time value in t2, the match picks the greatest time in t1 less than or equal to the given value in t2.
For further reading, please refer to https://code.kx.com/q/ref/joins/#aj-aj0-ajf-ajf0-asof-join
Related
At work we have a SQL Server 2019 instance. There are two big tables in the same database that have to be joined to obtain specific data: one contains GPS data taken at 4 minutes interval, but there could be in between records as well. The important thing here is that there is a non-key attribute called file_id, a timestamp (DATE_TIME column), latitude and longitude. The other attributes are not relevant, and the key is autogenerated (identity column), so it's of no use to me.
The other table contains transaction records that have among other attributes a timestamp (FECHATRX column), and the same non-key file ID attribute the GPS table has, and also an autogenerated key with no relation at all with the other key.
For each file ID there are several records in both tables that have to be somewhat joined in order to obtain for a given file ID and transaction record both its latitude and longitude. The tables aren't ordered at all.
My idea is to pair records of the same file ID and I imagine it to be this way (I haven't done it yet because it was explained to me earlier today):
Order both tables by file ID and timestamp
For the same file ID all the transaction table records who have a timestamp equal or greater than the first timestamp from the GPS table and lower than the following timestamp from the same GPS table will be given both latitude and longitude values from that first record, for they are considered to belong to that latitude-longitude pair (actually they probably are somewhere in the middle, but this is an assumption and everybody agrees with this)
When a transaction record has a timestamp equal or greater than the second timestamp, then the third timestamp will act as an end point, all the records in between from the transaction table will obtain the same coordinates from the second record until one timestamp equals or be greater than the third, and so on until a new file ID is reached or there are no records left in one or both tables.
To me this sounds like nested cursors and several variables to save the first GPS record's values while we are also saving the second GPS record's timestamp for comparison purposes, and of course the file ID itself as a control variable, but is this the best way to obtain the latitude / longitude data for each and every transaction record from the GPS table?
Are other approaches better than using nested cursors?
As I said I haven't done anything yet, the only thing I can do is to show you some data from both tables, I just wanted to know if there is another (and simpler) way of doing this than using nested cursors.
Thank you.
Alejandro
No need to reorder tables or use a complex cursor loop. A properly constructed index can provide an efficient join, and a CROSS APPLY or OUTER_APPLY can be used to handle the complex "select closest prior GPS coordinate" lookup logic.
Assuming your table structure is something like:
GPS(gps_id, file_id, timestamp, latitude, longitude, ...)
Transaction(transaction_id, timestamp, file_id, ...)
First create an index on the GPS table to allow efficient lookup by file_id and timestamp.
CREATE INDEX IX_GPS_FileId_Timestamp
ON GPS(file_id, timestamp)
INCLUDE(latitude, longitude)
The INCLUDE clause is optional, but it allows the index to serve up lat/long without the need to access the primary table.
You can then use a query something like:
SELECT *
FROM Transaction T
OUTER APPLY (
SELECT TOP 1 *
FROM GPS G
WHERE G.file_id = T.file_id
AND G.timestamp <= T.timestamp
ORDER BY G.timestamp DESC
) G1
OUTER APPLY (
SELECT TOP 1 *
FROM GPS G
WHERE G.file_id = T.file_id
AND G.timestamp >= T.timestamp
ORDER BY G.timestamp
) G2
CROSS APPLY and OUTER APPLY are like INNER JOIN and LEFT JOIN, but have more flexibility to define a subquery with complex conditions to handle cases like this.
The G1 subquery will efficiently select the immediately prior or equal GPS timestamp record with the same file_id. G2 does the same for equal or immediately following. Per your requirements, you only need G1, but having both might give you the opportunity to interpolate between the two points or to handle cases where there is no preceding matching record.
See this fiddle for a demo.
I have a relatively small table (t1) and want to join a large time series (t2) with it through an as-of-join. The timeseries is too large to do it in one go so I want to split the operation up into daily chunks.
Given a list of dates, I want to execute the same query for each date:
aj[`Id`Timestamp;select from t1 where date=some_date;select from t2 where date=some_date]
Ideally this should return a list of tables l so that I can simply join them:
l[0] uj/ 1_l
I believe something like this should work:
raze{aj[`Id`Timestamp;select from t1 where date=x;select from t2 where date=x]
}each exec distinct date from t1
I have a very big table in DB2 with around 500 million rows
I need to select the last day only based on a timestamp column and other conditions
I did something like this, but it takes forever (about 10 minutes) to get the results, is there any other way to query this faster, I am not familiar with db2
DTM is a timestamp datatype
select a, b, c, d, e, DTM from table1
where e = 'I' and DTM > current timestamp - 1 days
Any help please
Besides an index, another option may be range partitioning on this table. If you could range partition by month, you would only have to scan the month, for example. Even better if you could partition by day (and have the partitioning key in the index so you had partitioned index too).
I have a table of time series data where for almost all queries, I wish to select data ordered by collection time. I do have a timestamp column, but I do not want to use actual Timestamps for this, because if two entries have the same timestamp it is crucial that I be able to sort them in the order they were collected, which is information I have at Insert time.
My current schema just has a timestamp column. How would I alter my schema to make sure I can sort based on collection/insertion time, and make sure querying in collection/insertion order is efficient?
Add column based on sequence (i.e. serial), and create index on (timestamp_column, serial_column). Then you can have insertion order (more or less) by doing:
ORDER BY timestamp_column, serial_column;
You could use a SERIAL column called insert_order. This way there will be no two rows with the same value. However, I am not sure that you requirement of being in absolute time order is possible to achieve.
For example suppose there are two transactions, T1 and T2 and they do happen at the same time, and you are running on a machine with multiple processor, so in fact both T1 and T2 did the insert at exactly the same instant. Is this a case that you are concerned about? There was not enough info your question to know exactly.
Also with a serial column you have the issue of gaps, for example T1 cloud grab serial value 14 and T2 can grab value 15, then T1 rolls back and T2 does not, so you have to expect that the insert_order column might have gaps in it.
A question from a beginner.
I have two tables. One (A) contains Start_time, End_time, Status. Second one (B) contains Timestamp, Error_code. Second table is automatically logged by system every few seconds, so it contains lots of non unique values of Error_code (it changes randomly, but within a time range from table A). What i need is to select unique error code for every time range (in my case every row) from the first table for every time range in table A:
A.Start_time, A.End_time B.Error_code.
I have come to this:
select A.Start_time,
A.End_time,
B.Error_code
from B
inner join A
on B.Timestamp between A.Start_time and A.End_time
This is wrong, i know.
Any thoughts are welcome.
If tour query gives a lot of duplicates use distinct to remove them:
select DISTINCT A.Start_time, A.End_time, B.Error_code
from B
inner join A on B.Timestamp between A.Start_time and A.End_time