How to find "missing" rows in a Heroku Postgres database?

How to find "missing" rows in a Heroku Postgres database? - postgresql

I keep getting emails that state the following:
"[Your database X] contains 16,919 rows, exceeding the plan limit of 10,000. INSERT privileges to the database will be automatically revoked in 7 days. This will cause service failures in most applications dependent on this database."
Even though I have limited the number of rows in my single table application to max 10 000, usually hovering at 9999.
I have checked the number of rows and the number of tables by psql and PGAdmin3.
Any idea how Heroku counts the number of rows in a database? Is this a platform bug or am I missing something?

Right now it makes estimated counts until you reach a certain threshold at which point it performs accurate counts (this mechanism subject to change). It will never revoke access or email a user without doing an accurate count first (SELECT count(*) FROM table1 + SELECT count(*) FROM table2 etc).
It does not count system tables; it considers all user level tables. Oftentimes people don't realize they have tables that are eating up rows, such as sessions, events or logs.

Related

Out of shared memory errors when doing a select query

Using: Postgres 10.2
Ignoring some unrelated columns, I have the table animals with these columns (omitting some unrelated columns):
animalid PK number
location (text)
type (text)
name (text)
data (jsonb) for eg: {"age": 2, "tagid": 11 }
Important points:
This table is partitioned, into 1000 child tables.
Each table could be having around 100,000 records and hence a total of ~100 million records
My application tries to fetch an animal, based on the animalid. For eg:
select "animals"."animalid", "animals"."type", "animals"."data", "animals"."location", "animals"."name"
from "animals"
where "animals"."animalid" = 2241
This though, throws the error (when there are many requests):
ERROR: out of shared memory
Hint: You might need to increase max_locks_per_transaction.
I would think that the select queries shouldn't be affected by the locks on these tables? Or could it be that queries outside this application can fill up the memory due to the locks acquired on the partitioned tables, thus also affecting the select queries?
I have an option to use the partitioned table directly (as there is a logic to determine it). Could this help in fixing the issue?
Is it generally a good idea to use a bigger value for the max_locks_per_transaction setting, if there are partitioned tables and queries that update them?
My main area of concern is that I do not quite understand why a select query is being affected here. Can anyone help explain?

Deleting rows in Postgres table using ctid

We have a table with nearly 2 billion events recorded. As per our data model, each event is uniquely identified with 4 columns combined primary key. Excluding the primary key, there are 5 B-tree indexes each on single different columns. So totally 6 B-tree indexes.
The events recorded span for years and now we need to remove the data older than 1 year.
We have a time column with long values recorded for each event. And we use the following query,
delete from events where ctid = any ( array (select ctid from events where time < 1517423400000 limit 10000) )
Does the indices gets updated?
During testing, it didn't.
After insertion,
total_table_size - 27893760
table_size - 7659520
index_size - 20209664
After deletion,
total_table_size - 20226048
table_size - 0
index_size - 20209664

Reindex can be done
Command: REINDEX
Description: rebuild indexes
Syntax:
REINDEX { INDEX | TABLE | DATABASE | SYSTEM } name [ FORCE ]

Considering #a_horse_with_no_name method is the good solution.
What we had:
Postgres version 9.4.
1 table with 2 billion rows with 21 columns (all bigint) and 5 columns combined primary key and 5 individual column indices with date spanning 2 years.
It looks similar to time-series data with a time column containing UNIX timestamp except that its analytics project, so time is not at an ordered increase. The table was insert and select only (most select queries use aggregate functions).
What we need: Our data span is 6 months and need to remove the old data.
What we did (with less knowledge on Postgres internals):
Delete rows at 10000 batch rate.
At inital, the delete was so fast taking ms, as the bloat increased each batch delete increased to nearly 10s. Then autovacuum got triggered and it ran for almost 3 months. The insert rate was high and each batch delete has increased the WAL size too. Poor stats in the table made the current queries so slow that they ran for minutes and hours.
So we decided to go for Partitioning. Using Table Inheritance in 9.4, we implemented it.
Note: Postgres has Declarative Partitioning from version 10, which handles most manual work needed in partitioning using Table Inheritance.
Please go through the official docs as they have clear explanation.
Simplified and how we implemented it:
Create parent table
Create child table inheriting it with check constraints. (We had monthly partitions and created using schedular)
Indexes are need to be created separately for each child table
To drop old data, just drop the table, so vacuum is not needed and will be instant.
Make sure to have the postgres property constraint_exclusion to partition.
VACUUM ANALYZE the old partition after started inserting in the new partition. (In our case, it helped the query planner to use Index-Only scan instead of Seq. scan)
Using Triggers as mentioned in the docs may make the inserts slower, so we deviated from it, as we partitioned based on time column, we calculated the table name at application level based on time value before every insert and it didn't affect the insert rate for us.
Also read other caveats mentioned there.

AWS database single column adds extremely much data

I'm retrieving data from an AWS database using PgAdmin. This works well. The problem is that I have one column that I set to True after I retrieve the corresponding row, where originally it is set to Null. Doing so adds an enormous amount of data to my database.
I have checked that this is not due to other processes: it only happens when my program is running.
I am certain no rows are being added, I have checked the number of rows before and after and they're the same.
Furthermore, it only does this when changing specific tables, when I update other tables in the same database with the same process, the database size stays the same. It also does not always increase the database size, only once every couple changes does the total size increase.
How can changing a single boolean from Null to True add 0.1 MB to my database?
I'm using the following commands to check my database makeup:
To get table sizes
SELECT
relname as Table,
pg_total_relation_size(relid) As Size,
pg_size_pretty(pg_total_relation_size(relid) - pg_relation_size(relid)) as External Size
FROM pg_catalog.pg_statio_user_tables ORDER BY pg_total_relation_size(relid) DESC;
To get number of rows:
SELECT schemaname,relname,n_live_tup
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC;
To get database size:
SELECT pg_database_size('mydatabasename')

If you have not changed that then your fillfactor is at 100% on the table since that is the default.
This means that every change in your table will mark the changed row as obsolete and will recreate the updated row. The issue could be even worse if you have indices on your table since those should be updated on every row change too. As you could imagine this hurts the UPDATE performance too.
So technically if you would read the whole table and update even the smallest column after reading the rows then it would double the table size when your fillfactor is 100.
What you can do is to ALTER your table lower the fillfactor on it, then VACUUM it:
ALTER TABLE your_table SET (fillfactor = 90);
VACUUM FULL your_table;
Of course with this step your table will be about 10% bigger but Postgres will spare some space for your updates and it won't change its size with your process.
The reason why autovacuum helps is because it cleans the obsoleted rows periodically and therefore it will keep your table at the same size. But it puts a lot of pressure on your database. If you happen to know that you'll do operations like you described in the opening question then I would recommend tuning the fillfactor for your needs.

The problem is that (source):
"In normal PostgreSQL operation, tuples that are deleted or obsoleted by an update are not physically removed from their table"
Furthermore, we did not always close the cursor which also increased database size while running.
One last problem is that we were running one huge query, not allowing the system to autovacuum properly. This problem is described in more detail here
Our solution was to re-approach the problem such that the rows did not have to be updated. Other solutions that we could think of but have not tried is to stop the process every once in a while allowing the autovacuum to work correctly.

What do you mean adds data? to all the data files? specifically to some files?
to get a precise answer you should supply more details, but generally speaking, any DB operation will add data to the transaction logs, and possibly other files.

PostgreSQL - Made update to table in a function I created and now the order of rows in the table has changed

Does anyone know why the order of the rows changed after I made an update to table? Is there any way to make the order go back or change to another order eg:order by alphabetical?
This is the update I performed:
update t set amount = amount + 1 where account = accountNumber
After this update when I go and see the table, the order has changed

A table doesn't have a natural row order, some database systems will actually refuse your query if you don't add an ORDER BY clause at the end of your SELECT
Why did the order change?
Because the database engine fetches your rows in the physical order they come from the storage. Some engines, like SQL Server, can have a CLUSTERED INDEX which forces a physical order, but it is still never really guaranteed that you get your results in that precise order.
The clustered index exist mostly as an optimization. PostgreSQL has a similar CLUSTER function to change the physical order, but it's an heavy process which locks the table : http://www.postgresql.org/docs/9.1/static/sql-cluster.html
How to force an alphabetical order of the rows?
Add an ORDER BY clause in your query.
SELECT * FROM table ORDER BY column

postgreSQL - pg_class question

PostgreSQL stores statistics about tables in the system table called pg_class. The query planner accesses this table for every query. These statistics may only be updated using the analyze command. If the analyze command is not run often, the statistics in this table may not be accurate and the query planner may make poor decisions which can degrade system performance. Another strategy is for the query planner to generate these statistics for each query (including selects, inserts, updates, and deletes). This approach would allow the query planner to have the most up-to-date statistics possible.
Why postgres always rely on pg_class instead?

pg_class doesn't contain all the statistics needed by the planner, it only contains information about the structure of the table. Statistics generated by analyze command contain information about values existing in each column so when executing a command like:
SELECT * FROM tab WHERE cname = "pg";
the planner knows how much rows are in the table and how many rows have the value "pg" in the column cname. These information does not exist in pg_class.
Another nice feature of PostgreSQL is autovacuum, in 99,9999% of cases it should be enabled so the database actualizes statistics as soon as some (can be defined in config file) number of rows change. That minimizes the chance of wrong execution plan because of wrong table statistics.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse