Making sense of Postgres row sizes

Making sense of Postgres row sizes - postgresql

I got a large (>100M rows) Postgres table with structure {integer, integer, integer, timestamp without time zone}. I expected the size of a row to be 3*integer + 1*timestamp = 3*4 + 1*8 = 20 bytes.
In reality the row size is pg_relation_size(tbl) / count(*) = 52 bytes. Why?
(No deletes are done against the table: pg_relation_size(tbl, 'fsm') ~= 0)

Calculation of row size is much more complex than that.
Storage is typically partitioned in 8 kB data pages. There is a small fixed overhead per page, possible remainders not big enough to fit another tuple, and more importantly dead rows or a percentage initially reserved with the FILLFACTOR setting.
And there is even more overhead per row (tuple): an item identifier of 4 bytes at the start of the page, the HeapTupleHeader of 23 bytes and alignment padding. The start of the tuple header as well as the start of tuple data are aligned at a multiple of MAXALIGN, which is 8 bytes on a typical 64-bit machine. Some data types require alignment to the next multiple of 2, 4 or 8 bytes.
Quoting the manual on the system table pg_tpye:
typalign is the alignment required when storing a value of this type.
It applies to storage on disk as well as most representations of the
value inside PostgreSQL. When multiple values are stored
consecutively, such as in the representation of a complete row on
disk, padding is inserted before a datum of this type so that it
begins on the specified boundary. The alignment reference is the
beginning of the first datum in the sequence.
Possible values are:
c = char alignment, i.e., no alignment needed.
s = short alignment (2 bytes on most machines).
i = int alignment (4 bytes on most machines).
d = double alignment (8 bytes on many machines, but by no means all).
Read about the basics in the manual here.
Your example
This results in 4 bytes of padding after your 3 integer columns, because the timestamp column requires double alignment and needs to start at the next multiple of 8 bytes.
So, one row occupies:
23 -- heaptupleheader
+ 1 -- padding or NULL bitmap
+ 12 -- 3 * integer (no alignment padding here)
+ 4 -- padding after 3rd integer
+ 8 -- timestamp
+ 0 -- no padding since tuple ends at multiple of MAXALIGN
Plus item identifier per tuple in the page header (as pointed out by #A.H. in the comment):
+ 4 -- item identifier in page header
------
= 52 bytes
So we arrive at the observed 52 bytes.
The calculation pg_relation_size(tbl) / count(*) is a pessimistic estimation. pg_relation_size(tbl) includes bloat (dead rows) and space reserved by fillfactor, as well as overhead per data page and per table. (And we didn't even mention compression for long varlena data in TOAST tables, since it doesn't apply here.)
You can install the additional module pgstattuple and call SELECT * FROM pgstattuple('tbl_name'); for more information on table and tuple size.
Related:
Table size with page layout
Calculating and saving space in PostgreSQL

Each row has metadata associated with it. The correct formula is (assuming naïve alignment):
3 * 4 + 1 * 8 == your data
24 bytes == row overhead
total size per row: 23 + 20
Or roughly 53 bytes. I actually wrote postgresql-varint specifically to help with this problem with this exact use case. You may want to look at a similar post for additional details re: tuple overhead.

Related

What is avg_item_size in bt_page_stats?

i am trying to find out what exactly is avg_item_size in bt_page_stats?
I am using postgresql 13.4 with extension pageinspect.
In my case, if index is created on text it is about 20-200 and if it is on integer it is about 700, I am curious why is avg_item_size bigger with integer than text.
Update:
Here are bt_page_stats of my index created on single integer column. bt_page_stats
Here is same on bt_page_items.
In picture 2 I can see that itemlen prop is 24 and than I have 9 items with itemlen 808. So avg_item_size 729 is from there and it is averagy of all items.
Now I see, that if I have index created on string, there are no tids but on integer there is a lot of data in tids.
After future exploration I found source code of PostgreSQL> https://docs.huihoo.com/doxygen/postgresql/dir_57dbf4d3eda9e499038b5c7aaccc39c5.html directly to function pageinspect.
I was not sure about tids, but from https://www.postgresql.org/docs/8.3/datatype-oid.html I found the answer:
tid, or tuple identifier (row identifier). This is the data type of the system column ctid. A tuple ID is a pair (block number, tuple index within block) that identifies the physical location of the row within its table.
But I still do not understand why tids are when I create index on integer and not on text.

avg_item_size is the average size of an index entry.
For an index on a single integer column, that should be 16 (if you have duplicates, and you are using v13 or above, it can be less because of index de-duplication).
An index entry in a leaf page will consist of the t_tid (address of the row), which is 6 bytes, a 2-byte t_info and the integer (4 bytes, but really 8 bytes because of alignment). You can use bt_page_items to verify that.

kdb q - efficiently count tables in flatfiles

I have a lot of tables stored in flat files (in a directory called basepath) and I want to check their number of rows. The best I can so right now is:
c:([] filename:system "ls ",basepath;
tablesize:count each get each hsym `$basepath,/:system "ls ",basepath)
which loads each table entirely into memory and then performs the count (that's quite slow). Is saving as splayed tables the only way to make this faster (because I would only load 1 column and count that) or is there a trick in q that I can use?
Thanks for the help

If you have basepath defined as a string of the path to directory where all your flat tables are stored then you can create a dictionary of the row counts as follows:
q)cnt:{count get hsym x}
q)filename:key hsym `$basepath
q)filename!cnt each filename
t| 2
g| 3
This is where I have flat tables t and g saved in my basepath directory. This stops you from having to use system commands which are often less effiecient.
The function cnt takes the path of each flat table (as a symbol) and returns the number of rows without saving them into memory.
The best solution if you have control of the process of saving such files down is to add an extra step of saving the meta information of the row count somewhere seperate at the same time of saving the raw data. This would allow you to quickly access the table size from this file instead of reading the full tbale in each time.
However, note that to avoid pulling them into memory at all you would have to instead use read1 and look at the headers on the binary data. As you said it would be better to save as a splayed table and read in one column.
UPDATE: I would not recommend doing this and strongly suggest doing the above but for curiosity after looking into using read1 here's an example what what a hacky solution might look like:
f:{
b:read1(y;0;x);
if[not 0x62630b~b[2 4 5];'`$"not a table"];
cc:first first((),"i";(),4)1:b 7+til 4;
if[null ce:first where cc=sums 0x0=11 _ b;:.z.s[x*2;y]];
c:`$"\000" vs "c"$b[11+til ce];
n:first first((),"i";(),4)1:b[(20+ce)+til 4];
:`columns`rows!(c;n);
}[2000]
The q binary file format isn’t documented anywhere, the only way to figure it out is to save different things and see how the bytes change. It’s also subject to changes between versions - the above is written for 3.5 and is probably valid for 3.0-3.5 only, not the latest 3.6 release or anything 2.X.
The given code works in the following way:
reads a chunk from the front of the file
validates that it looks like a flat unkeyed table (flip[98] of a dict[99] with symbol[11] keys)
reads the count of symbols in the list of columns as a little endian 4 byte int
scans through the null terminated strings for that many zero bytes
if the columns are so numerous or verbose that we don’t have them
all in this chunk it will double the size of the chunk and try again
turn the strings into symbols
using the offset we get from the end of the column list, skip a bit
more of the header for the mixed list of columns
then read the count from the header of the first column
Hope this answers your question!

From experimenting with the binary files, it seems that the table count is saved as part of the binary file when you save down a flat file, taking up 4 bytes after the initial object type and column headings which will vary from table to table.
`:test set ([]a:1 2 3;b:4 5 6;c:7 8 9;aa:10 11 12;bb:13 14 15)
q)read1 `:test
0xff016200630b000500000061006200630061610062620000000500000009000300000
0 7 11 31
bytes | example | meaning
---------------------------------------------------------------------------------------
0 - 5 | 0xff016200630b0 | object is a flat table
7 - 11 | 0x05000000 | number of columns (5)
12- 22 | 0x6100620063006161006262 | one byte for the ascii values of column "a" and "b" in hex followed by the one byte separator
23 - 30 | 0x0000050000000900 | 8 bytes that can be skipped
31 - 34 | 0x0300000 | 4 bytes for row count of first column (3)
This should help you understand the function that Fiona posted.
The binary is saved down little-endian meaning the most-significant byte is the right-hand most digit - doing this in decimal for the number 100 would give 001, with the 100's (most significant) on the right and then 10s and finally 1s on the left. In the binary file, each group of 2 digits is a byte.
You can use 1: to read in the contents of a binary file, with additional arguments in the list specifying the offset - where to start reading from, and how many bytes to read. In our case we want to start at byte 31 and read in 4 bytes, specifying the output should be an integer and to cut the input into separate 4 byte chunks.
q)first first (enlist "i";enlist 4)1:(`:test;31;4)
3i
Converting the little-endian bytes into a long gives us the row count. Since this only has to read in 4 bytes instead of the whole file it is a lot quicker.
For a table with 10000 rows and 2 columns there is not much difference:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10000;31;4)
0
q)\t count get `:test10000
0
For a table with 100m rows and 2 columns:
q)\t 0x0 sv reverse first (enlist "x";enlist 1)1:(`:test10m;31;4)
0
q)\t count get `:test10m
2023
If you have a splayed table instead you can read in the number of elements in one of the columns from bytes 9-13 like so, assuming the column is a simple list:
q)first first (enlist "i";enlist 4)1:(`:a;8;4)
3i
You can read more about reading in from binary files here https://code.kx.com/q/ref/filenumbers/#1-binary-files

You can make what you currently have more efficient by using the following
counttables:{count each get each hsym `$basepath}
This will improve the speed of the count by not including the extra read in of the data as well as the join which you are currently doing. You are correct though that if the tables where saved splayed you would only have to read in the one column making it much more efficient.

If your tables are stored uncompressed there's probably something quite hacky you could do with a read1 on the headers within the file until you find the first column header.
But v hacky :-(
Are you responsible for saving these down? Can you keep a running state as you do?

Cannot allocate memory for a column of compound floats on a partitioned table

I have a partitioned table in my hdb that includes a column containing large lists of floats (at most 400 floats per element). eg each element looks like
(100.0 1.0 ...)
When trying to select on this column from days where there are particularly high numbers of rows I get an error saying
'./2015.02.07/table/column# Cannot allocate memory
The same error arises from a query like:
select column[;0] from table where date=2015.02.07
even though on days with fewer rows this query returns the first value of each element in the column.
Is there a way to stream this column in a select to decrease the memory requirements of holding the whole column in memory for a large day?
EDIT
.Q.ind on large days fails with the same error.
ie given I can work with 2015.02.01 but not 2015.02.02:
.Q.ind[select from table where date=2015.02.01;enlist 1]
is fine but
.Q.ind[select from table where date=2015.02.02;enlist 1]
fails with
{0!$[#.Q.pm;p3;(?).]#[x;0;p1[;y;z]]}
'./2015.02.10/table/column2#: Cannot allocate memory
#
.[?]
(+`time`sym`column1`column2!`:./2015.02.02/table;();0b;())
I should note I am using the free 32-bit version

I think this is all just a combination of the free-32bit memory limitation, the fact that your row counts are possibly large and the fact that (unavoidably) something must be pulled entirely into memory when retrieving data from a column, whether it is the column itself that gets entirely pulled in (in the non-nested case) or if its the nested-index column that gets entirely pulled in.
Another thing to consider is that kdb uses powers-of-two (buddy) memory allocation. Even if todays table only contains one more row than yesterdays, the memory requirements per column could double. Take a simple example:
In the free 32bit version (windows) you can create this many floats and it only uses ~1.07gb of memory
q)\ts 134217726?1.0
3093 1073741952
However, try to generate one extra float and you hit a memory limit
q)\ts 134217727?1.0
wsfull
So even a small amount of rows in the difference between one day and the next can be very significant if you're near the boundary of allocatable powers of two.
--DISCLAIMER-- the following is hacky and is only intended for debugging!
You can actually manually try to access the data from the nested list, though you may still have memory issues here anyway.
Create a nested table and splay it
q)tab:([] col1:(101 102 103f;104 105f;106 107 108 109 110f;111 112f))
q)tab
col1
--------------------
101 102 103f
104 105f
106 107 108 109 110f
111 112f
q)
q)`:test/ set tab
`:test/
You can try to read in the indices from the nested-index file
q)2_first (enlist "j";enlist 8)1:`:test/col1
3 5 10 12
So the indices for splitting the full list of floats (the col1# file) is index 3, index 5, 10 etc etc
Say I want the first 3 rows
q)myrows:3#2_first (enlist "j";enlist 8)1:`:test/col1
q)myrows
3 5 10
then I know that I need the first 10 floats from the col1# file and need to split them at index 3 and 5. Then I can read the col1# file partially and split it correctly
q)(0,-1_myrows) cut raze (enlist "f";enlist 8)1:(`$":test/col1#";0;8*last myrows)
101 102 103f
104 105f
106 107 108 109 110f
But this is precisely what KDB does under the covers anyway so I suspect that you'll still have trouble even reading in the nested-index file in the first place.
Check this debug/hack and see if you can partially read that way. But obviously it's not a long-term solution!

Nested columns make querying in the usual way difficult, as the # file also needs to be loaded into memory (even with a [;0])
Your best bet is to select map a date partition in, and then select within that chunk by chunk, e.g. a million rows at a time (or whatever is sensible given the size of nested floats).
Perhaps also consider 32bit floats, if some decimal accuracy can be sacrificed.
EDIT
So after comments I guess the best way is to go each partition a number of lines at a time with .Q.ind

Just to give my 2 cents on this, I had a similar error but with a 64-bit instance.
I suspected that the memory needed to be de-fragmented as it was running for almost a year.
Bouncing the instance solved the issue, and released a lot of virtual memory

postgres text storage inline or in "background table"?

In PostgreSQL, how can I tell whether a text column is stored inline or stored in a "background table"?
Documentation for text column types says that
Very long values are also stored in background tables so that they do not interfere with rapid access to shorter column values.
Is there a fixed length at which a value is determined to be "very long"? If not, are there other ways of telling how my columns are laid out on disk? I have a table with several columns that are text (or varchar(n)) and want to understand how they are stored under the hood. Is there more documentation on these "background tables" somewhere?

Any varlena data type (all types with variable length or types longer than 4 bytes (32 bits) or 8 bytes (64 bits)) can be TOASTed - TOAST is a process that tries to reduce long rows (records) to 8KB page size.
Row size is checked before physically storing to the relation. When the size exceeds 2KB, most larger fields are selected, compressed, sliced to 2KB chunks and moved to a secondary table file with the suffix _toast. A pointer to the toast file replaces the data in the main storage. This process is repeated while the row is bigger than 2KB.
Follow the links provided by a_horse_with_no_name and IMSoP for more detailed documentation.

If your table is called t1, then enter \d+ t1 at your psql prompt, it will show a column storage mode.

Does not using NULL in PostgreSQL still use a NULL bitmap in the header?

Apparently PostgreSQL stores a couple of values in the header of each database row.
If I don't use NULL values in that table - is the null bitmap still there?
Does defining the columns with NOT NULL make any difference?

It's actually more complex than that.
The null bitmap needs one bit per column in the row, rounded up to full bytes. It is only there if the actual row includes at least one NULL value and is fully allocated in that case. NOT NULL constraints do not directly affect that. (Of course, if all fields of your table are NOT NULL, there can never be a null bitmap.)
The "heap tuple header" (per row) is 23 bytes long. Actual data starts at a multiple of MAXALIGN (Maximum data alignment) after that, which is typically 8 bytes on 64-bit OS (4 bytes on 32-bit OS). Run the following command from your PostgreSQL binary dir as root to get a definitive answer:
./pg_controldata /path/to/my/dbcluster
On a typical Debian-based installation of Postgres 12 that would be:
sudo /usr/lib/postgresql/12/bin/pg_controldata /var/lib/postgresql/12/main
Either way, there is one free byte between the header and the aligned start of the data, which the null bitmap can utilize. As long as your table has 8 columns or less, NULL storage is effectively absolutely free (as far as disk space is concerned).
After that, another MAXALIGN (typically 8 bytes) is allocated for the null bitmap to cover another (typically) 64 fields. Etc.
This is valid for at least versions 8.4 - 12 and most likely won't change.

The null bitmap is only present if the HEAP_HASNULL bit is set in t_infomask. If it is present it begins just after the fixed header and occupies enough bytes to have one bit per data column (that is, t_natts bits altogether). In this list of bits, a 1 bit indicates not-null, a 0 bit is a null. When the bitmap is not present, all columns are assumed not-null.
http://www.postgresql.org/docs/9.0/static/storage-page-layout.html#HEAPTUPLEHEADERDATA-TABLE
so for every 8 columns you use one byte of extra storage. Then for every about million rows that would take up one megabyte of storage. Does not really seem that important. I would define the tables how they needed to be defined and not worry about null headers.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse