Why is my PostgreSQL table larger (in GB) than the csv it came from? - postgresql

A < 4 GB csv became a 7.7 GB table in my AWS Postgres instance. And a 14 GB csv wouldn't load into 22 GB of space, I'm guessing because it is also going to double in size! Is this factor of two normal? And if so, why, and is it reliable?

There are many possible reasons:
Indexes take up space. If you have lots of indexes, especially multi-column indexes or GiST / GIN indexes, they can be a big space hog.
Some data types are represented more compactly in text form than in a table. For example, 1 consumes 1 byte in csv (or 2 if you count the comma delimiter) but if you store it in a bigint column it requires 8 bytes.
If there's a FILLFACTOR set, PostgreSQL will intentionally waste space so make later UPDATEs and INSERTs faster. If you don't know what FILLFACTOR is, then there isn't one set.
PostgreSQL has a much larger per-row overhead than CSV. In CSV, the per-row overhead is 2 bytes for a newline and carriage return. Rows in a PostgreSQL table require 24 to 28 bytes, plus data values, mainly because of the metadata required for multiversion concurrency control. So a CSV with very many narrow rows will produce a significantly bigger table than one the same size in bytes that has fewer wider rows.
PostgreSQL can do out-of-line storage and compression of values using TOAST. This can make big text strings significantly smaller in the database than in CSV.
You can use octet_size and pg_column_size to get PostgreSQL to tell you how big rows are. Because of TOAST out-of-line compressed storage, the pg_column_size might be different for a tuple produced by a VALUES expression vs one SELECTed from a table.
You can also use pg_total_relation_size to find out how big the table for a given sample input is.

Related

Postgresql - How do I find compressed fields in a table?

Postgresql uses different storage techniques for field values. If values become large (e.g. are long texts), Postgresql will eventually apply compression and/or TOAST the values.
How do I find out which fields in a table are compressed?
(Backgound: I have database that stores tiny BLOBs in a some columns and I want to find out how much of it is compressed - if compression is hardly used, I want to turn it off, so Postgres won't waste CPU cycles on trying)
Starting in v14, there is the function pg_column_compression.
select pg_column_compression(colname),count(*) from tablename group by 1;

How to decrease size of a large postgresql table

I have a postgresql table that is "frozen" i.e. no new data is coming into it. The table is strictly used for reading purposes. The table contains about 17M records. The table has 130 columns and can be queried multiple different ways. To make the queries faster, I created indices for all combinations for filters that can be used. So I have a total of about 265 indexes on the table. Each index is about 1.1 GB. This makes the total table size to be around 265 GB. I have vacuumed the table as well.
Question
Is there a way to further bring down the disk usage of this table?
Is there a better way to handle queries for "frozen" tables that never get any data entered into them?
If your table or indexes are bloated, then VACUUM FULL tablename could shrink them. But if they aren't bloated then this won't do any good. This is not a benign operation, it will lock the table for a period of time (needing rebuild hundreds of index, probably a long period of time) and generate large amounts of IO and of WAL, the last of which will be especially troublesome for replicas. So I would test it on a non-production clone to see it actually shrinks things and see about how long of a maintenance window you will need to declare.
Other than that, be more judicious in your choice of indexes. How did you get the list of "all combinations for filters that can be used"? Was it by inspecting your source code, or just by tackling slow queries one by one until you ran out of slow queries? Maybe you can look at snapshots of pg_stat_user_indexes taken a few days apart to see if all them are actually being used.
Are these mostly two-column indexes?

Will gzipping + storage as bytea save more disk space over storage as text?

If I have a table containing 30 million rows and one of the columns in the table is currently a text column. The column is populated with random strings of a size between 2 and 10 kb. I don't need to search the strings directly.
I'm considering gzipping the strings before saving them (typically reducing them 2x in size) and instead save them in a bytea column.
I have read that Postgresql does some compression of text columns by default, so I'm wondering: Will there be any actual disk space reduction as a product of the suggested change?
I'm running Postgresql 9.3
PostgreSQL stores text columns that exceed about 2000 bytes in a TOAST table and compresses the data.
The compression is fast, but not very good, so you can have some savings if you use a different compression method. Since the stored values are not very large, the savings will probably be small.
If you want to go that way, you should disable compression on that already compressed column:
ALTER TABLE tab
ALTER bin_col SET STORAGE EXTERNAL;
I'd recommend that you go with PostgreSQL's standard compression and keep things simple, but the best thing would be for you to run a test and see if you get a benefit from using custom compression.

How much data can practically be stored in a Postgres table field?

I have an application that stores data in a Postgres database that I am considering extending. This would entail storing larger amounts of data in fields of a table. How large can the data in a field reasonably be before one starts running into performance issues? 100kb? 1mb? 100mb? 500mb? Does it matter what type the data is stored as (other than the fact the binary data tends to be more compact)?
Up to 1 GB per field, 32 TB per relation.
Upper limits are as defined in the "about" page of Postgres.
... have since been moved to the manual page "PostgreSQL Limits".
But storing massive amounts of data in table columns is typically a bad idea. If you want to change anything in a 1 GB field, Postgres has to write a new row version, which is extremely inefficient. That's not the use case relational databases are optimized for.
Consider storing large objects in files or at least use binary large objects for this. See:
Storing long binary (raw data) strings
I would think twice before even storing megabytes of data into a single table field. You can do it. Doesn't mean you should.

Maximum (usable) number of rows in a Postgresql table

I realize that, per Pg docs (http://www.postgresql.org/about/), one can store an unlimited number of rows in a table. However, what is the "rule of thumb" for usable number of rows, if any?
Background: I want to store daily readings for a couple of decades for 13 million cells. That works out to 13 M * (366|365) * 20 ~ 9.5e10, or 95 B rows (in reality, around 120 B rows).
So, using table partitioning, I set up a master table, and then inherited tables by year. That divvies up the rows to ~ 5.2 B rows per table.
Each row is 9 SMALLINTs, and two INTs, so, 26 bytes. Add to that, the Pg overhead of 23 bytes per row, and we get 49 bytes per row. So, each table, without any PK or any other index, will weigh in at ~ 0.25 TB.
For starters, I have created only a subset of the above data, that is, only for about 250,000 cells. I have to do a bunch of tuning (create proper indexes, etc.), but the performance is really terrible right now. Besides, every time I need to add more data, I will have to drop the keys and the recreate them. The saving grace is that once everything is loaded, it will be a readonly database.
Any suggestions? Any other strategy for partitioning?
It's not just "a bunch of tuning (indexes etc.)". This is crucial and a must do.
You posted few details, but let's try.
The rule is: Try and find the most common working set. See if it fits in RAM. Optimize hardware, PG/OS buffer settings and PG indexes/clustering for it. Otherwise look for aggregates, or if it's not acceptable and you need fully random access, think what hardware could scan the whole table for you in reasonable time.
How large is your table (in gigabytes)? How does it compare to total RAM? What are your PG settings, including shared_buffers and effective_cache_size? Is this a dedicated server? If you have a 250-gig table and about 10 GB of RAM, it means you can only fit 4% of the table.
Are there any columns which are commonly used for filtering, such as state or date? Can you identify the working set that is most commonly used (like only last month)? If so, consider partitioning or clustering on these columns, and definitely index them. Basically, you're trying to make sure that as much of the working set as possible fits in RAM.
Avoid scanning the table at all costs if it does not fit in RAM. If you really need absolutely random access, the only way it could be usable is really sophisticated hardware. You would need a persistent storage/RAM configuration which can read 250 GB in reasonable time.