Redshift query results incorrect before VACUUM - amazon-redshift

When Redshift uses an index to run a query (say counts), does it exclude counting the rows in the unsorted region?
I had copied a lot a data using the COPY command, but did not VACUUM the table afterwords. On running my query (involving joins with several tables), the results of the query were wrong - the newly copied rows in the unsorted region weren't counted.
Then, after vacuuming the table, the query began returning the correct results. Is this expected behavior, or is this a bug introduced by Amazon?

Vacuuming won't have any effect on COPYed rows, which are effectively inserts. Vacuum physically deletes rows previously deleted with an SQL delete statement, which only marks the rows as deleted, so they don't participate in subsequent queries but still consume disk space.
Redshift is an eventually consistent database, so even if your COPY command had completed, the rows may not yet be visible to queries.
Running vacuum is basically a defrag, which requires all rows to be reorganized. This (probably) causes the table to be brought into a consistent state, ie all rows are visible to queries.

Related

How can i lock statistics (analyze) of tables in postgresql?

How can i lock statistics (analyze) of tables in postgresql?
There is a scheduled job truncates and inserts data into a table.( Size more than 1 gb).
Truncating the table causes the statistic to change. Then, a query using this table as a source gives undesired execution plan and takes too much time.
If i analyze the table manually, the duration of the query decreases acceptable time.
You cannot do that.
If the scheduled job performs mass data modification, it had better run an explicit ANALYZE when it is done (and a VACUUM wouldn't hurt either).
It does not make much sense to keep the statistics when you truncate a table and insert new data into it.

Need to drop 900+ postgres schemas but it wants me to vacuum first

I have 900+ postgres schemas (which collectively hold 40,000 tables) that I'd like to drop. However, it appears that it wants me to vacuum everything first, because I get this whenever I try to drop a schema.
ERROR: database is not accepting commands to avoid wraparound data loss in database
Is there a way to drop a large number of schemas without having to vacuum first?
IS there any problem is running the vacuum command. It is like a garbage collection for a database. I use postgre database and I use this command before doing any major work like backup or creating a sql scripts of the whole database.
VACUUM reclaims storage occupied by dead tuples. In normal PostgreSQL operation, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present until a VACUUM is done. Therefore it's necessary to do VACUUM periodically, especially on frequently-updated tables.
You've got two choices. Do the vacuum, or drop the whole database. xid wrap-around must be avoided.
https://blog.sentry.io/2015/07/23/transaction-id-wraparound-in-postgres
There is not much you can do, except VACUUM oder dropping the database.
In addition, if you don't do the VACUUM, the database will not work for anything, not just for the schemas you want to drop.

PostgreSQL vacuuming a frequently updating jsonb field

I have postgres table with jsonb field. Field size is about 2-4kb per row. My application updates 100k rows per day 2000 times (changing 0.1-0.5% of data in field). Autovacuum is off, vacuum full runs every day at night.
Vacuum frees about 100-300gb every day and takes a long time to go causing application downtime.
The question is: can I solve this problem with jsonb field or I must split that field onto other simple tables?
If your concern is long down time then yes VACUUM FULL requires exclusive lock on the table being vacuumed for entire period of run.
I'll suggest you to try pg_repack extension or pg_squeeze extension - depending upon postgres version. Unlike CLUSTER and VACUUM FULL it works online, without holding an exclusive lock on the processed tables during processing. These extensions are really easy to install and use in postgres. These extensions can reduce your downtime significantly and also will help to reduce runs of VACUUM FULL.

How does postgresql lock tables when inserting and selecting?

I'm migrating data from one table to another in an environment where any long locks or downtime is not acceptable, in total about 80000 rows. Essentially the query boils down to this simple case:
INSERT INTO table_2
SELECT * FROM table_1
JOIN table_3 on table_1.id = table_3.id
All 3 tables are being read from and could have an insert at any time. I want to just run the query above, but I'm not sure how the locking works and whether the tables will be totally inaccessible during the operation. My understanding tells me that only the affected rows (newly inserted) will be locked. Table 1 is just being selected, so no harm, and concurrent inserts are safe so table 2 should be freely accessible.
Is this understanding correct, and can I run this query in a production environment without fear? If it's not safe, what is the standard way to accomplish this?
You're fine.
If you're interested in the details, you can read up on multiversion concurrency control, or on the details of the Postgres MVCC implementation, or how its various locking modes interact, but the implications for your case are nicely summarised in the docs:
reading never blocks writing and writing never blocks reading
In short, every record stored in the database has some version number attached to it, and every query knows which versions to consider and which to ignore.
This means that an INSERT can safely write to a table without locking it, as any concurrent queries will simply ignore the new rows until the inserting transaction decides to commit.

postgresql 9.2 never vacuumed and analyze

I have given a postgres 9.2 DB around 20GB of size.
I looked through the database and saw that it has been never run vacuum and/or analyze on any tables.
Autovacuum is on and the transaction wraparound limit is very far (only 1% of it).
I know nothing about the data activity (number of deletes,inserts, updates), but I see, it uses a lot of index and sequence.
My question is:
does the lack of vacuum and/or analyze affect data integrity (for example a select doesn't show all the rows matches the select from a table or from an index)? The speed of querys and writes doesn't matter.
is it possible that after the vacuum and/or analyze the same query gives a different answer than it would executed before the vacuum/analyze command?
I'm fairly new to PG, thank you for your help!!
Regards,
Figaro88
Running vacuum and/or analyze will not change the result set produced by any select operation (unless there was a bug in PostgreSQL). They may effect the order of results if you do not supply an ORDER BY clause.