Question regarding pg_stat_reset() in postgresql - postgresql

I have a question regarding function pg_stat_reset(); I am trying to collect database table stats on regular basis and for that purpose i am using stats from various postgres tables like pg_stat_all_tables, pg_stat_database. But data in this table is accumulated over period of time. I am planning to reset it on daily basis. I just want to know if its safe to run this function on daily, hourly basis on production database. And also if safe are there any settings PG offer by which i can reset stats of regular interval, apart from setting cronjob and running pg_stat_reset()
https://www.postgresql.org/docs/12/monitoring-stats.html

Don't do it, it will interfere with proper functioning of autovacuum. Rather than resetting counters to zero, just take snapshots of the counters and subtract one snapshot from a later one.

Related

PostgreSQL delete and aggregate data periodically

I'm developing a sensor monitoring application using Thingsboard CE and PostgreSQL.
Contex:
We collect data every second, such that we can have a real time view of the sensors measurements.
This however is very exhaustive on storage and does not constitute a requirement other than enabling real time monitoring. For example there is no need to check measurements made last week with such granularity (1 sec intervals), hence no need to keep such large volumes of data occupying resources. The average value for every 5 minutes would be perfectly fine when consulting the history for values from previous days.
Question:
This poses the question on how to delete existing rows from the database while aggregating the data being deleted and inserting a new row that would average the deleted data for a given interval. For example I would like to keep raw data (measurements every second) for the present day and aggregated data (average every 5 minutes) for the present month, etc.
What would be the best course of action to tackle this problem?
I checked to see if PostgreSQL had anything resembling this functionality but didn't find anything. My main ideia is to use a cron job to periodically perform the aggregations/deletions from raw data to aggregated data. Can anyone think of a better option? I very much welcome any suggestions and input.

How long are stats in PostgreSQL's pg_stat_all_indexes table stored?

I use PostgreSQL 11.1 and I'm trying to gather information from pg_stat_all_indexes table about indexes usage to determine whether a particular index can be removed or not - according to Index size/usage statistics section in https://wiki.postgresql.org/wiki/Index_Maintenance.
I noticed that information present there changes over time significantly (from couple of thousand index hits to 0) and I can't find information about these stats lifetime neither in the docs or this table itself. That worries me because I want to be sure, that the decision I'm going to make about index removal was based on stats from long enough period of time.
Is there any way to check how long these statistics are stored in the pg_stat_all_indexes table?
Quote from the manual
When the server shuts down cleanly, a permanent copy of the statistics data is stored in the pg_stat subdirectory, so that statistics can be retained across server restarts. When recovery is performed at server start [...], all statistics counters are reset
(emphasis mine)
If you want to reset the statistics to get a baseline, you can use pg_stat_reset() and then compare the values from that point on (e.g. by storing daily snapshots and comparing the differences between days)

What are some strategies to efficiently store a lot of data (millions of rows) in Postgres?

I host a popular website and want to store certain user events to analyze later. Things like: clicked on item, added to cart, removed from cart, etc. I imagine about 5,000,000+ new events would be coming in every day.
My basic idea is to take the event, and store it in a row in Postgres along with a unique user id.
What are some strategies to handle this much data? I can't imagine one giant table is realistic. I've had a couple people recommend things like: dumping the tables into Amazon Redshift at the end of every day, Snowflake, Google BigQuery, Hadoop.
What would you do?
I would partition the table, and as soon as you don't need the detailed data in the live system, detach a partition and export it to an archive and/or aggregate it and put the results into a data warehouse for analyses.
We have similar use case with PostgreSQL 10 and 11. We collect different metrics from customers' websites.
We have several partitioned tables for different data and together we collect per day more then 300 millions rows, i.e. 50-80 GB data daily. In some special days even 2x-3x more.
Collecting database keeps data for current and last day (because especially around midnight there can be big mess with timestamps from different part of the world).
On previous versions PG 9.x we transferred data 1x per day to our main PostgreSQL Warehouse DB (currently 20+ TB). Now we implemented logical replication from collecting database into Warehouse because sync of whole partitions was lately really heavy and long.
Beside of it we daily copy new data to Bigquery for really heavy analytical processing which would on PostgreSQL take like 24+ hours (real life results - trust me). On BQ we get results in minutes but pay sometimes a lot for it...
So daily partitions are reasonable segmentation. Especially with logical replication you do not need to worry. From our experiences I would recommend to not do any exports to BQ etc. from collecting database. Only from Warehouse.

Run vacuum by schedule

I'm using Postgres version 9.6
Most of my tables are for queries, update, insert.
Most of them around 200K-700K.
There are bigger (millions) and smaller.
Is that a good idea to perform vacuum (and analyze?) operation once a day? once a week? regardless if there is an autovacuum..
Advantages vs disadvantages?
Autovacuum is done when needed and it only creates statistics that are used when planning a query.
Basically you never need to do this manually, unless you have made vast changes to a table (filled it with data for example), and want to use it in another query within a few milliseconds. In that scenario, old statistics will result in the query planner coming up with a very bad query plan and will lead to a significantly slower query.
What you might want to do once per day / per week, or whatever, is to cluster tables, recreate degraded indexes, on tables that were modified a lot. Research these topics more to decide if / when / how to do it.

MongoDB High Avg. Flush Time - Write Heavy

I'm using MongoDB with approximately 4 million documents and around 5-6GB database size. The machine has 10GB of RAM, and free only reports around 3.7GB in use. The database is used for a video game related ladder (rankings) website, separated by region.
It's a fairly write heavy operation, but still gets a significant number of reads as well. We use an updater which queries an outside source every hour or two. This updater then processes the records and updates documents on the database. The updater only processes one region at a time (see previous paragraph), so approximately 33% of the database is updated.
When the updater runs, and for the duration that it runs, the average flush time spikes up to around 35-40 seconds, and we experience general slowdowns with other queries. The updater is RAN on a SEPARATE MACHINE and only queries MongoDB at the end, when all the data has been retrieved and processed from the third party.
Some people have suggested slowing down the number of updates, or only updating players who have changed, but the problem comes down to rankings. Since we support ties between players, we need to pre-calculate the ranks - so if only a few users have actually changed ranks, we still need to update the rest of the users ranks accordingly. At least, that was the case with MySQL - I'm not sure if there is a good solution with MongoDB for ranking ~800K->1.2 million documents while supporting ties.
My question is: how can we improve the flush and slowdown we're experiencing? Why is it spiking so high? Would disabling journaling (to take some load off the i/o) help, as data loss isn't something I'm worried about as the database is updated frequently regardless?
Server status: http://pastebin.com/w1ETfPWs
You are using the wrong tool for the job. MongoDB isn't designed for ranking large ladders in real time, at least not quickly.
Use something like Redis, Redis have something called a "Sorted List" designed just for this job, with it you can have 100 millions entries and still fetch the 5000000th to 5001000th at sub millisecond speed.
From the official site (Redis - Sorted sets):
Sorted sets
With sorted sets you can add, remove, or update elements in a very fast way (in a time proportional to the logarithm of the number of elements). Since elements are taken in order and not ordered afterwards, you can also get ranges by score or by rank (position) in a very fast way. Accessing the middle of a sorted set is also very fast, so you can use Sorted Sets as a smart list of non repeating elements where you can quickly access everything you need: elements in order, fast existence test, fast access to elements in the middle!
In short with sorted sets you can do a lot of tasks with great performance that are really hard to model in other kind of databases.
With Sorted Sets you can:
Take a leader board in a massive online game, where every time a new score is submitted you update it using ZADD. You can easily take the top users using ZRANGE, you can also, given an user name, return its rank in the listing using ZRANK. Using ZRANK and ZRANGE together you can show users with a score similar to a given user. All very quickly.
Sorted Sets are often used in order to index data that is stored inside Redis. For instance if you have many hashes representing users, you can use a sorted set with elements having the age of the user as the score and the ID of the user as the value. So using ZRANGEBYSCORE it will be trivial and fast to retrieve all the users with a given interval of ages.
Sorted Sets are probably the most advanced Redis data types, so take some time to check the full list of Sorted Set commands to discover what you can do with Redis!
Without seeing any disk statistics, I am of the opinion that you are saturating your disks.
This can be checked with iostat -xmt 2, and checking the %util column.
Please don't disable journalling - you will only cause more issues later down the line when your machine crashes.
Separating collections will have no effect. Separating databases may, but if you're IO bound, this will do nothing to help you.
Options
If I am correct, and your disks are saturated, adding more disks in a RAID 10 configuration will vastly help performance and durability - more so if you separate the journal off to an SSD.
Assuming that this machine is a single server, you can setup a replicaset and send your read queries there. This should help you a fair bit, but not as much as the disks.