I have a table where values are being written in every few hours. However, the table is randomly being dropped after ~ 1 - 2 hours (not consistent, sometimes 65 minutes, sometimes 90minutes) and grafana will just show ‘no data’. If I change the query to another table and try to change back, the table name disappears from the drop-down list.
I have my retention policy set as ‘autogen’, so the data should not be cleared out for a week.
Related
If I have large amounts of data in a table defined like
CREATE TABLE sensor_values ( ts TIMESTAMPTZ(35, 6) NOT NULL,
value FLOAT8(17, 17) DEFAULT 'NaN' :: REAL NOT NULL,
sensor_id INT4(10) NOT NULL, );
Data comes in every minute for thousands of points. Quite often though I need to extract and work with daily values over years (On a web frontend). To aid this I would like a sensor_values_days table that only has the daily sums for each point and then I can use this for faster queries over longer timespans.
I don't want a trigger for every write to the db as I am afraid that would slow down the already bottle neck of writes to the db.
Is there a way to trigger only after so many rows have been inserted ?
Or perhaps an index and maintains a index of a sum of entries over days ? I don't think that is possible.
What would be the best way to do this. It would not have to be very up to date. Losing the last few hours or a day would not be an issue.
Thanks
What would be the best way to do this.
Install clickhouse and use AggregatingMergeTree table type.
With postgres:
Create per-period aggregate table. You can have several with different granularity, like hours, days, and months.
Have a cron or scheduled task run at the end of each period plus a few minutes. First, select the latest timestamp in the per-period table, so you know at which period to start. Then, aggregate all rows in the main table for periods that came after the last available one. This process will also work if the per-period table is empty, or if it missed the last update then it will catch up.
In order to do only inserts and no updates, you have to run it at the end of each period, to make sure it got all the data. You can also store the first and last timestamp of the rows that were aggregated, so later if you check the table you see it did use all the data from the period.
After aggregation, the "hour" table should be 60x smaller than the "minute" table, that should help!
Then, repeat the same process for the "day" and "month" table.
If you want up-to-date stats, you can UNION ALL the results of the "per day" table (for example) to the results of the live table, but only pull the current day out of the live table, since all the previous days's worth of data have been summarized into the "per day" table. Hopefully, the current day's data will be cached in RAM.
It would not have to be very up to date. Losing the last few hours or a day would not be an issue.
Also if you want to partition your huge table, make sure you do it before its size becomes unmanageable...
Materialized Views and a Cron every 5 minutes can help you:
https://wiki.postgresql.org/wiki/Incremental_View_Maintenance
In PG14, we will have INCREMENTAL MATERIALIZED VIEW, but for the moment is in devel.
I've created a dashboard that calculates from a table range that will expand every time the data is refreshed.
The dashboard has tables on a tab dashboard. The table range is in a service tab and has 27 columns of data. I am counting the number of services in the table range (every time it is refreshed) by service type (Column A). Now I need to limit that count by the Close Date column (Column P) in which closed is denoted by a date, but only wanting to count active cases. Active cases are based on not having a date, so null or blank. I am in essence counting nulls in a table range which expands daily. My formulas are not working (often counting all null rows beyond the table range), and I need to exclude the header Close Date.
I've tried variations of this and using Countif:
=SUMPRODUCT(--('service'!A:A="chemical dependency")*('service'!P:P=0))
This one counts the header, but I don't have consistent results across all the options in the services.
Thank you.
Our web based app with 100,000 concurrent users has a use case where we auto-save the user's activity every 5 seconds. Consider a table like this:
create table essays
(
id uuid not null constraint essays_pkey primary key,
userId text not null,
essayparts jsonb default '{ }' :: jsonb,
create_date timestamp with time zone default now() not null,
modify_date timestamp with time zone default now() not null
);
create index essays_create_idx on essays ("create_date");
create index essays_modify_idx on essays ("modify_date");
This works well for us as all the stuff related to a user's essay such as title, brief byline. requestor, full essay body, etc. are all stored in the essayparts column as a JSON. For auto-saving the essay, we don't insert new rows all the time though. We update each ID (each essay) with all its components.
So there are plenty of updates per essay, as this is a time consuming and thoughtful activity. Given the auto save every 5 seconds, if a user was to be writing for half an hour, we'd have updated her essay around 360 times.
This would be fine with the "HOT" (heap only tuples) functionality of PostgreSQL. We're using v10 so we are fine. However, the challenge is that we also update the modify_date column every time the essay is saved and this has an index too. Which means by the principle of HOT this is not benefiting from the HOT update and a lot of fragmentation occurs.
I suppose in the web or mobile world this is not an unusual pattern. Many services seem to auto-save content. Are they insert only? If so, if the user logs out and comes back in, how do they show the records, by looking at the max(modify_date)? Or is there any other mechanism to leverage HOT updates while also updating an indexed column in the table?
Appreciate any pointers, thank you!
Performing an update every 5 second with 100000 concurrent users will produce 20000 updates per second. This is quite challenging as such, and you would need a good system to pull it off, but autovacuum will never be able to keep up if those updates are not HOT.
You have several options:
Choose a relational database management system other than PostgreSQL that updates rows in place.
Do not index modify_date and hope that HOT will do the trick.
Perform these updates way less often than once every 5 seconds (who needs auto-save every 5 seconds anyway?).
Auto-save the data somewhere else than in the database.
I have a table that updates incrementally once every day. Table has about 10 million records but I update only rows that were created or updated in the last 24 hours. This runs perfectly fine however I have a problem with one of the columns that calculates the aging of each sale (record) that is calculated based on time when the record was created and compares it against the latest data run time. I would like to know how could I have just this column update for each of the 10 million rows each time the table updates incrementally.
I am using Amazon Redshift as the DB.
Thanks..
I have a table in BigQuery that has 35 million rows that I want to turn into a Tableau extract, but the number of rows is so great that it can't add all of them at one time. The solution I have come up with is to first the number of rows that are presented by the BigQuery view by only showing rows within a particular date range for a full extract and then after the extract is completed slowly increase the number of rows displayed by the view (by altering the where statement to only show particular rows) and then having Tableau perform an incremental extract based on the field containing the time stamp. That worked for a few of the incremental updates, but now I've run into a problem where Tableau Desktop says 'Required columns are not present in the remote data source. Perform full refresh of the extract', except that nothing in the data source or within tableau has changed other than the WHERE clause affecting the range of dates that the view presents. This was apparently a problem in Tableau 8, but I'm on Tableau 9.3. Anyone have any suggestions?
I'm not sure if this will be a solution for everyone, but the problem that I was having seems to have been caused by making connections to Google BigQuery per hour.