How to calculate size of tables which are saved on disk? (PosgreSQL) - postgresql

How to calculate size of tables which are saved on disk?
Based on my internet searching, how to calculate the length of the table based on the formula:
8KB × ceil(number of records / floor(floor(8KB × fillfactor - 24) / (28 + data length of 1 record)))
Example:
Column | Type |
aid | integer |
bid | integer |
abalance | integer |
filler | character(84) |
data length of 1 record = aid(4 bytes) + bid(4 bytes) + abalance(4 bytes) + filler(84 bytes + 1 byte) = 97 byte
The data length of a record must be rounded to 8 bytes.
=> Data length of 1 record is 104 bytes.
Therefore, I think that 1 character is contained in 1 byte of memory.
However, column "filler" can be input with 84 characters "a" (single byte) or 84 characters "あ" (double-byte)
I don’t know why double-byte character can be contained in single byte character?
Can you explain to me this question?

It's much simpler. Use pg_relation_size to calculate the size of one relation alone (without the associated TOAST tables and indexes) or pg_total_relation_size to include all associated objects.

Related

How to determine how much space 1 row will take in Postgres db?

I'm very new to Postgres so my math could be off here...
This is my table:
CREATE TABLE audit (
id BIGSERIAL PRIMARY KEY,
content_id VARCHAR (50) NULL,
type VARCHAR (100) NOT NULL,
size bigint NOT NULL,
timestamp1 timestamp NOT NULL DEFAULT NOW(),
timestamp2 timestamp NOT NULL DEFAULT NOW());
I want to make some estimations on how much space 1 row would occupy. So I did something like this:
1 row = id + content_id + type + size + timestamp1 + timestamp2
= 8 + 100 + 50 + 8 + 8 + 8 bytes
= 182 bytes
I also created this same table in my local postgres but the numbers are not matching
INSERT INTO public.audit(
content_id, type, size)
VALUES ('aaa', 'bbb', 100);
SELECT pg_size_pretty( pg_total_relation_size('audit') ); -- returns 24 kb
INSERT INTO public.audit(
content_id, type, size)
VALUES ('aaaaaaaaaaaaa', 'bbbbbbbbbbbbbb', 100000000000);
SELECT pg_size_pretty( pg_total_relation_size('audit') ); -- still returns 24 kb
Which brings me to think that Postgres reserves a space of 24 kb to start with and as I put in more data it will get incremented by 132 bytes once I go beyond 24 kb? But something inside me says that can't be right.
I want to see how much space 1 row would occupy in Postgres db so I can analyze how much data I can potentially store in it.
Edit
After reading more I've come up with this, is it correct?
1 row =
23 (heaptupleheader)
+ 1 (padding)
+ 8 (id)
+ 50 (content_id)
+ 6 (padding)
+ 100 (type)
+ 4 (padding)
+ 8 (size)
+ 8 (timestamp)
+ 8 (timestamp)
= 216 bytes
That "something inside me says that can't be right" is wrong. Actually trying id determine the size of each row is impractical. You can calculate the average row, and given a large number of rows the better that average get. Part of that reason is variable length columns. Your definition varchar(50) does not required bytes of storage unless unless it contains 50 bytes, if it has 20 then it only takes up 20 bytes (plus overhead), even then it's not exact as the padding may change. The definition only specifies the Maximum not the actual, storage is on actual.
As far a your 24kb that doesn't seem out-of-line at all. Keep in mind that physical I/O is the slowest possible individual operation and trying to adjust to individual rows for I/O would bring your system to a screeching halt. Postgres therefore only reads in full blocks (and allocates space the same), and/or multiple blocks. Typically with a block size of 8K (8192 bytes). This is the trade off I/O performance vs. space allocation. It appears your system has a multi-block read of 3 blocks (??). If anything is surprising it would that is is that small.
In short trying to get the size of a row not piratical, instead get several hundred representative rows and calculate the average.
BTW you can change the length just by rearranging your columns:
1 row =
23 (heaptupleheader)
+ 1 (padding)
+ 8 (id)
+ 8 (size)
+ 8 (timestamp)
+ 8 (timestamp)
+ 50 (content_id)
+ 2 (padding) (if content contains all 50 char)
+ 100 (type) (if type contains all 100 char)
= 208 bytes

Why does selecting a real type only prints out max 6 digits in Postgres?

I have a Postgres table which has a column of real type.
When I use a query in order to select the values from this column (e. g. select amount from my_table), it round the value for numbers larger than 100,000. (It completely ignores the decimal part). For numbers larger than 1,000,000, it displays them using the scientific notation. For numbers smaller than 100,000 it strips off the decimal part so that only 6 digits max are printed out overall (e. g. 1,000.12345 becomes 1,000.12).
It is only when I cast the value to double precision (using CAST(amount as double precision)) that it starts to behave as I'd expect and it prints out all the stored decimal digits.
Does anyone have an idea why is Postgres behaving in this manner?
The answer is likely in PostgreSQL source code postgres/src/fe_utils/print.c.
Comments for format_numeric_locale say:
/*
* Format a numeric value per current LC_NUMERIC locale setting
*
* Returns the appropriately formatted string in a new allocated block,
* caller must free.
*
* setDecimalLocale() must have been called earlier.
*/
See also comments in setDecimalLocale.
It's also possible that some reasons can be found in SQL standards.
OK but with PG 12.2 there are some formatting differences between real and double precision:
# select x, x::double precision from t order by 1;
x | x
---------------+-------------
123456 | 123456
999999 | 999999
1e+06 | 1000000
1.000001e+06 | 1000001
1.234567e+06 | 1234567
1.2345671e+06 | 1234567.125
1.234568e+06 | 1234568
9.999999e+06 | 9999999
9.999999e+06 | 9999999
1e+08 | 100000000
1.2345679e+09 | 1234567936
1.2345679e+09 | 1234567936
(12 rows)
And it seems that you can have 7 decimal digits precision.

Postgresql - converting text to ts_vector

Sorry for the basic question.
I have a table with the following columns.
Column | Type | Modifiers
--------+---------+-----------
id | integer |
doc_id | bigint |
text | text |
I am trying to do text matching on the 'text' (3rd column)
I receive an error message when I try to text match on the text column. Saying that the string is too long for ts_vector.
I only want observations which contain the words "other events"
SELECT * FROM eightks\d
WHERE to_tsvector(text) ## to_tsquery('other_events')
I know that there are limitation to the length of the ts_vector.
Error Message
ERROR: string is too long for tsvector (2368732 bytes, max 1048575 bytes)
How do I convert the text column into a ts_vector and will this resolve my size limit problem?Alternatively, How do I exclude observations over the maximum size?
Postgres version 9.3.5.0
Here is the reference to the limit limit
Thanks

PostgreSQL - random primary key

I need a primary key for a PostgreSQL table. The ID should consist out of a number from about 20 numbers.
I am a beginner at database and also worked not with PostgreSQL. I found some examples for a random id, but that examples where with characters and I need only an integer.
Can anyone help me to resolve this problem?
I'm guessing you actually mean random 20 digit numbers, because a random number between 1 and 20 would rapidly repeat and cause collisions.
What you need probably isn't actually a random number, it's a number that appears random, while actually being a non-repeating pseudo-random sequence. Otherwise your inserts will randomly fail when there's a collision.
When I wanted to do something like this a while ago I asked the pgsql-general list, and got a very useful piece of advice: Use a feistel cipher over a normal sequence. See this useful wiki example. Credit to Daniel Vérité for the implementation.
Example:
postgres=# SELECT n, pseudo_encrypt(n) FROM generate_series(1,20) n;
n | pseudo_encrypt
----+----------------
1 | 1241588087
2 | 1500453386
3 | 1755259484
4 | 2014125264
5 | 124940686
6 | 379599332
7 | 638874329
8 | 898116564
9 | 1156015917
10 | 1410740028
11 | 1669489846
12 | 1929076480
13 | 36388047
14 | 295531848
15 | 554577288
16 | 809465203
17 | 1066218948
18 | 1326999099
19 | 1579890169
20 | 1840408665
(20 rows)
These aren't 20 digits, but you can pad them by multiplying them and truncating the result, or you can modify the feistel cipher function to produce larger values.
To use this for key generation, just write:
CREATE SEQUENCE mytable_id_seq;
CREATE TABLE mytable (
id bigint primary key default pseudo_encrypt(nextval('mytable_id_seq')),
....
);
ALTER SEQUENCE mytable_id_seq OWNED BY mytable;

Table size with page layout

I'm using PostgreSQL 9.2 on Oracle Linux Server release 6.3.
According to the storage layout documentation, a page layout holds:
PageHeaderData(24 byte)
n number of points to item(index item / table item) AKA ItemIdData(4 byte)
free space
n number of items
special space
I tested it to make some formula to estimate table size anticipated...(TOAST concept might be ignored.)
postgres=# \d t1;
Table "public.t1"
Column ',' Type ',' Modifiers
---------------+------------------------+------------------------------
code |character varying(8) |not null
name |character varying(100) |not null
act_yn |character(1) |not null default 'N'::bpchar
desc |character varying(100) |not null
org_code1 |character varying(3) |
org_cole2 |character varying(10) |
postgres=# insert into t1 values(
'11111111', -- 8
'1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111', <-- 100
'Y',
'1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111', <-- 100
'111',
'1111111111');
postgres=# select * from pgstattuple('t1');
table_len | tuple_count | tuple_len | tuple_percent | dead_tuple_count | dead_tuple_len | dead_tuple_percent | free_space | free_percent
-----------+-------------+-----------+---------------+------------------+----------------+--------------------+------------+--------------
8192 | 1 | 252 | 3.08 | 1 | 252 | 3.08 | 7644 | 93.31
(1 row)
Why is tuple_len 252 instead of 249? ("222 byte of all column's maximum length" PLUS
"27 byte of tuple header followed by an optional null bitmap, an optional object ID field, and the user data")
Where do the 3 bytes come from?
Is there something wrong with my formula?
Your calculation is off at several points.
Storage size of varchar, text (and character!) is, quoting the manual):
The storage requirement for a short string (up to 126 bytes) is 1 byte
plus the actual string, which includes the space padding in the case
of character. Longer strings have 4 bytes of overhead instead of 1.
Long strings are compressed by the system automatically, so the
physical requirement on disk might be less.
Bold emphasis mine to address question in comment.
The HeapTupleHeader occupies 23 bytes. But each tuple ("item" - row or index entry) has an item identifier at the start of the data page to it, totaling at the mentioned 27 bytes. The distinction is relevant as actual user data begins at a multiple of MAXALIGN from the start of each item, and the item identifier does not count against this offset - as well as the actual "tuple size".
1 byte of padding due to data alignment (multiple of 8), which is used for the NULL bitmap in this case.
No padding for type varchar (but the additional byte mentioned above)
So, the actual calculation (with all columns filled to the maximum) is:
23 -- heap tuple header
+ 1 -- NULL bitmap (or padding if row has NO null values)
+ 9 -- columns ...
+ 101
+ 2
+ 101
+ 4
+ 11
-------------
252 bytes
+ 4 -- item identifier at page start
Related:
Does not using NULL in PostgreSQL still use a NULL bitmap in the header?
Calculating and saving space in PostgreSQL