STL_SCAN data is not accurate over time in Redshift

STL_SCAN data is not accurate over time in Redshift - amazon-redshift

To analyse which Redshift tables are most used. We created a view. We are exporting the table scan data from this table to Prometheus to see the trends over time.
var (
RedshiftQueryTotalMetric = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Namespace: "redshift",
Subsystem: "scan",
Name: "query_total",
Help: "total number of queries executed",
},
[]string{"database", "schema", "tablename", "tableid"},
)
)
Problem:
sum(redshift_scan_query_total{schema="test_schema",tablename="test_table"})
I have used Gauge as the increments is not by one but can be multiple, but using Gauge as a counter.
Total queries in a table over 12 hours time should not drop. It should always increase but it keeps dropping.
Questions:
Why does the value of total scan in STL_SCAN table drops every few hours?
How to prevent it from dropping at-least for a day.

I suspect you are seeing Redshift purging older data from system tables. Redshift only keeps system data for a few days and it clears out older data the totals will drop. Have you looked at the data by start time hour? I expect you will see that the older hours are removed occasionally.

Related

Keep table synced with another but with accumulated / grouped data

If I have large amounts of data in a table defined like
CREATE TABLE sensor_values ( ts TIMESTAMPTZ(35, 6) NOT NULL,
value FLOAT8(17, 17) DEFAULT 'NaN' :: REAL NOT NULL,
sensor_id INT4(10) NOT NULL, );
Data comes in every minute for thousands of points. Quite often though I need to extract and work with daily values over years (On a web frontend). To aid this I would like a sensor_values_days table that only has the daily sums for each point and then I can use this for faster queries over longer timespans.
I don't want a trigger for every write to the db as I am afraid that would slow down the already bottle neck of writes to the db.
Is there a way to trigger only after so many rows have been inserted ?
Or perhaps an index and maintains a index of a sum of entries over days ? I don't think that is possible.
What would be the best way to do this. It would not have to be very up to date. Losing the last few hours or a day would not be an issue.
Thanks

What would be the best way to do this.
Install clickhouse and use AggregatingMergeTree table type.
With postgres:
Create per-period aggregate table. You can have several with different granularity, like hours, days, and months.
Have a cron or scheduled task run at the end of each period plus a few minutes. First, select the latest timestamp in the per-period table, so you know at which period to start. Then, aggregate all rows in the main table for periods that came after the last available one. This process will also work if the per-period table is empty, or if it missed the last update then it will catch up.
In order to do only inserts and no updates, you have to run it at the end of each period, to make sure it got all the data. You can also store the first and last timestamp of the rows that were aggregated, so later if you check the table you see it did use all the data from the period.
After aggregation, the "hour" table should be 60x smaller than the "minute" table, that should help!
Then, repeat the same process for the "day" and "month" table.
If you want up-to-date stats, you can UNION ALL the results of the "per day" table (for example) to the results of the live table, but only pull the current day out of the live table, since all the previous days's worth of data have been summarized into the "per day" table. Hopefully, the current day's data will be cached in RAM.
It would not have to be very up to date. Losing the last few hours or a day would not be an issue.
Also if you want to partition your huge table, make sure you do it before its size becomes unmanageable...

Materialized Views and a Cron every 5 minutes can help you:
https://wiki.postgresql.org/wiki/Incremental_View_Maintenance
In PG14, we will have INCREMENTAL MATERIALIZED VIEW, but for the moment is in devel.

Is there a technique with timescaledb to delete rows to reduce the frequency of older timescale data?

I'm storing a number of rows in a hypertable. The table size is growing quite large now even in its current test configuration.
I'd like to reduce the frequency of data from say once every 5 seconds to say once every 60 seconds for data older than a week by deleting a number of these older records.
Can anyone recommend an approach for doing so, or perhaps a better approach that better fits with timescaledb design?

So one of the next releases will have a bit in feature around data retention policies around continuous aggregations, so that you can define a continuous aggregation policy that rolls up secondly data into minutely data, then drop the secondly data that's older than some time period.
(That capability doesn't exist today with continuous aggs, but will very shortly. Right now the best approach is either to have some cron job that deletes old data, or one that copies from one table to a second while aggregating, then calling drop_chunks on the first table.)

Ok, I've read 2 minutes of timescaledb documentation, so I'm an expert, right. Here's what I propose:
You already have a table (I'll call it the business table) and a hypertable with raw 5-second data in it
Create a second hypertable with the same columns as the first hypertable
Insert into the 2nd hypertable using a 60-second windowing function and average, minimum, or maximum values for your readings data (you have to decide on which aggregation function is meaningful for your case.) This insert SQL looks something like:
INSERT into minute_table (timestamp, my_reading)
(SELECT time_bucket('60 seconds', time) as the_minute, avg(my_raw_reading)
FROM five_second_table
WHERE time < (now() - interval '1week')
GROUP BY the_minute
);
Next, delete from the 5-second hypertable where the timestamp in there is within any range of times in the 60-second hypertable.
Finally, schedule something like this to run every week.
Sorry I'm not fluent in all the timescaledb functions but this should get you started on the 'heavy lift' of manually aggregating up from 5-second to 60-second samples.

Take a look on Data Retention
For example:
SELECT drop_chunks(interval '24 hours', 'conditions');
This will drop all chunks from the hypertable 'conditions' that only include data older than this duration, and will not delete any individual rows of data in chunks.

Redshift - How to update aging column when doing an incremental update of the table

I have a table that updates incrementally once every day. Table has about 10 million records but I update only rows that were created or updated in the last 24 hours. This runs perfectly fine however I have a problem with one of the columns that calculates the aging of each sale (record) that is calculated based on time when the record was created and compares it against the latest data run time. I would like to know how could I have just this column update for each of the 10 million rows each time the table updates incrementally.
I am using Amazon Redshift as the DB.
Thanks..

Oracle Gather Statistics after Partition Exchange

Database: Oracle 12c
I have a process that selects data from Fact Table, summarizes it and push it to Summary Table.
Summary table is Range Partitioned (Trade Date) and List Partitioned (File Id).
the process picks up data from Fact table (where file_id=<> for all Trade Dates), summarizes it in a temp table and use Partition Exchange to move data from Temp table to one of the SubPartitions in Summary Table (as the process works on a File Id level).
Summary table is completely refreshed everyday (100% data will be exchanged).
Before the data is exchanged at the subpartition level, statistics are gathered and exchanged along with the data.
After the process is completed, we run dbms_gather_table_stats at partition level (in a for loop - for each partition) with granularity set as "approx_global and partition".
Even though we collect stats at the global level, user_tab_statistics for the summary table has "STALE_STATS" = YES for this table, however, partition & Subpartition stats are available.
when we run a query against the summary table (for a date range of 3 years), the query spins for a long time - spiking the CPU to 90%, but never returns any data.
I checked the explain plan on the query, Cardinality is showing as 1.
I read about incremental stats, but it seems increment will work if a few partitions change - it may not be the best option in my case, where data across all the partitions change completely.
I m looking for a strategy to gather statistics on the summary table - don't want to run a full gather stats.
Thanks.

Postgresql Hstore and Toast Bloat

I was using hstore, Postgresql 9.3.4, to store a count for each time an event happened in a given day, with an update like the following.
days_count = days_count || hstore('x', (coalesce((days_count -> 'x')::integer, 0) + 1)::text)
Where x is the day of the year. After running a simulation of expected behavior for production I ended up with a table that was 150MB + 2GB Toast + 25-30MB for the index, after Analyze and Vacuum.
I am now instead breaking up the above column into one for each month like the following
y_month_days_count = y_month_days_count || hstore('x', (coalesce((y_month_days_count -> 'x')::integer, 0) + 1)::text)
Where x is the day of the month, and y is the month of the year.
I am still running the simulation right now, but so far at third of the way done I am at 60MB + A pretty steady 20-30MB of Toast + 25-30MB for the index. Which means in the end I should end up with about 180MB + 30-40MB for Toast + 25MB-30MB for the index after Analyze and Vacuum.
So first is there any known issues with Hstore and Toast bloat that would explain my issue with my first set up?
Second will my current solution of breaking up the columns cause any type of issues with hstore and performance in the future because of the number of hstore columns on one table? It seems to be steady now with row numbers in the hundred of thousands, and while I know more columns can make things slower, I am unsure if this is worse with hstore columns.
Finally I did find something out. I have one hstore column that ends up representing each hour a day, so it has 24 different keys. When I run the simulation for just this column I end up with almost no toast, in the KB, but when I run the whole simulation, with the days broken up into months columns, my largest hstore has 52 keys.
So for a simple store of either a counter or a word or two, the max number of keys before I see any amount of toast for hstore is between 24 and 52 keys.

So first is there any known issues with Hstore and Toast bloat that would explain my issue with my first set up?
Yes.
When you update any part of an out-of-line stored TOASTed field like text, hstore or json the whole field must be re-written as a new row version. This is a consequence of MVCC - it's necessary to retain a copy of every version of the row that might still be visible to another transaction.
The old one can be vacuumed away when it's no longer required by any running transaction, so in practice this has minimal impact so long as autovacuum is running aggressively enough.
So if you're updating lots of rows with big text, hstore or json fields, or updating them frequently, tune autovacuum up so it runs more often and does work faster. Make sure you don't have long running <IDLE> in transaction connections.
You say the table sizes you quoted were "after analyze and vacuum" but I'm guessing you only ran a regular vacuum, so the table bloat would've been freed for re-use by PostgreSQL but not released back to the OS. See if VACUUM FULL compacts it.
Will my current solution of breaking up the columns cause any type of issues with hstore and performance in the future because of the number of hstore columns on one table?
Depends on your query patterns and workload, but probably not.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse