how to apply asof join on partitioned tables in kdb - kdb

I need to run asof join on few years data of trade and quote tables which are partitioned.
When I read through - https://code.kx.com/v2/ref/aj/
Above url states that -
"If further where constraints are used, the columns will be copied instead of mapped into memory, slowing down the join."
How can I use asof join over partitioned database with date and other constraints without impacting performance or memory.
Eg: aj[`sym`time;select from trade where date>2019.01.01, app=`abc; select from quote where date>2019.01.01]

You should do date= and run for each/peach date rather than doing date>, i.e.
raze{aj[`sym`time;select from trade where date=x, app=`abc;select from quote where date=x]}peach 2019.01.01 2019.01.02 2019.01.03
It's usually acceptable to have an additional filter on trade but probably only if that column app has a parted attribute. You can't have any filters on quote or performance tanks.
Note that with this approach you can't join prevailing data from previous day onto the next day but in most cases you wouldn't want/need to do that anyway

Related

Scalable approach to calculating balances over thousands of records in PostgreSQL

I'm facing the challenge of generating 'balance' values from thousands of entries in a PG table.
The rows in the table have many different columns, each useful in calculating that rows contribution to the balance. Each row/entry belongs to some profile. I need to calculate the balance value for some profile, from all entries belonging to that profile according to some set of rules. Complexity should be O(N) - N being the number of entries that belong to the profile.
The different approaches I took:
Fetching the rows, calculating balances on backend. This degrades doesn't scale well and degrades quickly, depending on the number of entries that belong to the profile. While fetching the entries is initially fast, once a profile has over 10,000 entries it becomes prohibitively slow.
I figured that a lot of time is being spent on transport, additionally we don't really need the rows only the balances. Since we already do the work of finding the entries, we can calculate the balance and save time on backend calculations as well as the transport of thousands of rows, thus leading to the second approach:
The second approach was creating a PG query that iterates over the rows and calculates the balance. This has proven to be more scalable when there are many entries per profile. This approach however, probably due to the complexity of the PG query, puts a lot of load on the database. It's enough to run 3-4 of these queries concurrently to max out the database CPU.
the third approach is to create a PL/pgSQL function to loop over the relevant entries and return the rows, hoping to reduce the impact on the database. This is the next thing I want to try.
Main question is - what would be the most efficient way to achieve this while being 'database friendly'?
Additionally:
Whether you think these approaches are sane?
Am I missing another obvious solution?
Is it unlikely that I improve over the performance of the query with the help of a function looping over same rows as the query, or is worth trying?
I realize I haven't provided a lot of concrete data, but I figured that since this is probably a common problem, maybe the issue can be understood from a general description.
EDIT:
To be a but more specific, I'm dealing with the following data:
CREATE TABLE entries (
profileid bigint NOT NULL,
programid bigint NOT NULL,
ledgerid text NOT NULL, -- this provides further granularity, on top of 'programid'
startdate timestamptz,
enddate timestamptz,
amount numeric NOT NULL
)
What I want to get is the balances for a certain profileid, separate by (programid, ledgerid).
The desired form is:
RETURNS TABLE (
programid bigint,
ledgerid text,
programid bigint,
currentbalance numeric,
pendingbalance numeric,
expiredbalance numeric,
spentbalance numeric
)
The four balance values are produced by applying arithmetic on certain entries. For example, negative amount would only add to spentbalance, expired balance is generated from entries that have a positive amount and the enddate is after now(), etc...
While I did manage to create a very large aggregate query with many calls to COALESCE(SUM(CASE WHEN ... amount), 0), I was wondering if I have anything to benefit from porting that logic into a PL/pgSQL function. However, when trying to implement this function I realized I don't know how to iterate over one function and return another, different in columns and rows, function. Should I use a temp table for this? Seems like an overkill as this query is expected to execute tens of times every second...

Union with Redshift native tables and external tables (Spectrum)

If I have a view that contains a union between a native table and external table like so (pseudocode):
create view vwPageViews as
select from PageViews
union all
select from PageViewsHistory
PageViews has for the last 2 years. External table has for older data than 2 years.
If a user selects from the view with filters for the last 6 months, how does RS Spectrum handle it - does it read the entire external table even though none will be returned (and accordingly cost us money for all of it)? (Assuming the s3 files are parquet based).
ex.
Select from vwPageViews where MyDate >= '01/01/2021'
What's the best approach for querying both cold and historical data using RS and Spectrum? Thanks!
How this will happen on Spectrum will depend on whether or not you have provided partitions for the data in S3. Without partitions (and a where clause on the partition) the Spectrum engines in S3 will have to read every file to determine if the needed data is in any of them. The cost of this will depend on the number and size of the files AND what format they are in. (CSV is more expensive than Parquet for example.)
The way around this is to partition the data in S3 and to have a WHERE clause on the partition value. This will exclude files from needing to be read when they don't match on the partition value.
The rub is in providing the WHERE clause for the partition as this will likely be less granular than the date or timestamp you using in your base data. For example if you partition on YearMonth (YYYYMM) and want to have a day level WHERE clause you will need to 2 parts to the WHERE clause - WHERE date_col >= 2015-07-12 AND part_col >= 201507. How to produce both WHERE conditions will depend on your solution around Redshift.

Is there a technique with timescaledb to delete rows to reduce the frequency of older timescale data?

I'm storing a number of rows in a hypertable. The table size is growing quite large now even in its current test configuration.
I'd like to reduce the frequency of data from say once every 5 seconds to say once every 60 seconds for data older than a week by deleting a number of these older records.
Can anyone recommend an approach for doing so, or perhaps a better approach that better fits with timescaledb design?
So one of the next releases will have a bit in feature around data retention policies around continuous aggregations, so that you can define a continuous aggregation policy that rolls up secondly data into minutely data, then drop the secondly data that's older than some time period.
(That capability doesn't exist today with continuous aggs, but will very shortly. Right now the best approach is either to have some cron job that deletes old data, or one that copies from one table to a second while aggregating, then calling drop_chunks on the first table.)
Ok, I've read 2 minutes of timescaledb documentation, so I'm an expert, right. Here's what I propose:
You already have a table (I'll call it the business table) and a hypertable with raw 5-second data in it
Create a second hypertable with the same columns as the first hypertable
Insert into the 2nd hypertable using a 60-second windowing function and average, minimum, or maximum values for your readings data (you have to decide on which aggregation function is meaningful for your case.) This insert SQL looks something like:
INSERT into minute_table (timestamp, my_reading)
(SELECT time_bucket('60 seconds', time) as the_minute, avg(my_raw_reading)
FROM five_second_table
WHERE time < (now() - interval '1week')
GROUP BY the_minute
);
Next, delete from the 5-second hypertable where the timestamp in there is within any range of times in the 60-second hypertable.
Finally, schedule something like this to run every week.
Sorry I'm not fluent in all the timescaledb functions but this should get you started on the 'heavy lift' of manually aggregating up from 5-second to 60-second samples.
Take a look on Data Retention
For example:
SELECT drop_chunks(interval '24 hours', 'conditions');
This will drop all chunks from the hypertable 'conditions' that only include data older than this duration, and will not delete any individual rows of data in chunks.

Simple update query taking too long - Postgres

I have a table with 28 million rows that I want to update. It has around 60 columns and a ID column (primary key) with an index created on it. I created four new columns and I want to populate them with the data from four columns from other table which also has an ID column with an index created on it. Both tables have the same amount of rows and just the primary key and the index on the IDENTI column. The query has been running for 15 hours and since it is high priority work, we are starting to get nervous about it and we don't have so much time to experiment with queries. We have never update a table so big (7 GB), so we are not sure if this amount of time is normal.
This is the query:
UPDATE consolidated
SET IDEDUP2=uni.IDEDUP2
USE21=uni.USE21
USE22=uni.USE22
PESOXX2=uni.PESOXX2
FROM uni_group uni, consolidated con
WHERE con.IDENTI=uni.IDENTI
How can I make it faster? Is it possible? If not, is there a way to check how much longer it is going to take (without killing the process)?
Just as additional information, we have ran before much more complex queries for 3 million row tables (postgis) and It has taken it about 15 hours as well.
You should not repeat the target table in the FROM clause. Your statement creates a cartesian join of the consolidated table with itself, which is not what you want.
You should use the following:
UPDATE consolidated con
SET IDEDUP2=uni.IDEDUP2
USE21=uni.USE21
USE22=uni.USE22
PESOXX2=uni.PESOXX2
FROM uni_group uni
WHERE con.IDENTI = uni.IDENTI

Executing query in chunks on Greenplum

I am trying to creating a way to convert bulk date queries into incremental query. For example, if a query has where condition specified as
WHERE date > now()::date - interval '365 days' and date < now()::date
this will fetch a years data if executed today. Now if the same query is executed tomorrow, 365 days data will again be fetched. However, I already have last 364 days data from previous run. I just want a single day's data to be fetched and a single day's data to be deleted from the system, so that I end up with 365 days data with better performance. This data is to be stored in a separate temp table.
To achieve this, I create an incremental query, which will be executed in next run. However, deleting the single date data is proving tricky when that "date" column does not feature in the SELECT clause but feature in the WHERE condition as the temp table schema will not have the "date" column.
So I thought of executing the bulk query in chunks and assign an ID to that chunk. This way, I can delete a chunk and add a chunk and other data remains unaffected.
Is there a way to achieve the same in postgres or greenplum? Like some inbuilt functionality. I went through the whole documentation but could not find any.
Also, if not, is there any better solution to this problem.
I think this is best handled with something like an aggregates table (I assume the issue is you have heavy aggregates to handle over a lot of data). This doesn't necessarily cause normalization problems (and data warehouses often denormalize anyway). In this regard the aggregates you need can be stored per day so you are able to cut down to one record per day of the closed data, plus non-closed data. Keeping the aggregates to data which cannot change is what is required to avoid the normal insert/update anomilies that normalization prevents.