Updating null value will split row in file Postgresql - postgresql

I have a colleague that tells me that the reason why we add default values instead of null values to our table is, that Postgresql allocates a number of bytes to a file when a new row is stored. And if this column gets updated later on, it might end up splitting this row into two rows in the file, and multiple IO operations will have to occure when reading and writing.
I'm not a Postgresql expert at all, and I have hard time finding any documentation suggesting this.
Can someone clearify this for me?
Is this a good reason for not having null values in a column, and using some default instead? Will there be any hughe performance issues in such cases?

I'm not sure I'd say the documentation is hard to find:
https://www.postgresql.org/docs/10/storage-file-layout.html
https://www.postgresql.org/docs/current/storage-page-layout.html
It's fair to say there is a lot to absorb though.
So, the reason you SHOULD have defaults rather than NULLs is because you don't want to have an "unknown" in your column. Start with the requirements before worrying about efficiency tweaks.
Whether a particular value is null is stored in a bitmap. This bitmap is optional - so if there are no nulls in a row then the bitmap is not created. So - that suggests nulls would make a row bigger. But wait, if a bit is set to show null then you don't need the overhead of of the value structure, and (IIRC - you'll need to check the docs) that can end up saving you space. There is a good chance that general per-row overheads and type alignment issues are far more important to you though.
However - all of this is ignoring the elephant* in the room which is that if you update a row then PostgreSQL marks the current version of the row as expired and creates a whole new row. So the whole description of how updates work is just confused in that first paragraph you wrote.
So - don't worry about the efficiency of nulls in 99.9% of cases. Worry about using them properly and about the general structure of your database, its indexes and queries.
* no I'm not apologising for that pun.

Related

PostgreSQL varchar length performance impact

I have a table in PostgreSQL which is under heavy load (reads). It is practically core table of an application. One column is used as a discriminator - column used by application, that determines type of entity (class) that represents given row. It has to be exactly one varchar column. Currently I do store full class name in it, like: "bank_statement_transaction".
When application selects all bank statement transactions, query is built like ... WHERE Discriminator = 'bank_statement_transaction' . This brings more readability and clarity to data, structure and code.
Table contains currently 3M rows and counting, approximately 100k new rows monthly. Discriminator was indexed during some performance tunings. I don't have any performance issues right now.
I am working on a new feature that requires some little refactoring and yeah I had an idea to change full class name (bank_statement_transaction) to short unique codes (BST)
I replicated dbo and changed full class name to code. With 3M rows, performance gain is barely measurable, same or 1-2 milliseconds faster.
Can anyone share experience with VARCHAR length impact on INDEX size and performance? On bigger data set? Is this change worth of it?
If you index strings, the index will become larger if the strings are long. The fan-out will be less, so the index will become deeper.
With an index scan that searches for a few rows, this won't be noticable: reading a few blocks more and running comparisons on longer strings may be lost in the noise for any but the simplest queries. Still, you'll be faster with smaller strings.
Maybe the most noticeable effect will be that a smaller index needs less RAM for caching, so the number of disk reads should go down.

Optimizing aggregation function and ordering in PostgreSQL

I have the following table 'medicion' with the followings fields:
id_variable[int](PK),
id_departamento[int](PK),
fecha [date](PK),
valor [number]`.
So, I want to get the minimum, maximum and the average of valor grouping all that data by id_variable. So my query is:
SELECT AVG(valor), MIN(valor), MAX(valor)
FROM medicion
GROUP BY id_variable;
Knowing that by default PostgreSQL builds an index for the primary key
(id_departamento, id_variable, fecha)
how can I optimize this query?, should I create a new index only by id_variable or the default index is valid in this query?
Thanks!
Since there is an avg(), and one needs all the values to compute an average, it's going to read the whole table. Unless you use a WHERE, but there is no WHERE, so I presume you want global statistics.
The only things an extra covering index brings are:
Not reading the entire table.
This could be beneficial if there was, say, 50 columns, or TEXTs which make the table file huge. In this case reading the whole table just to average a few int's would need to grind in tons of useless stuff from disk.
I mean, covering indexes are awesome when you want to snipe one column or two out of a huge table, and keep the small column set in cache. But this is not the case here, you only got small columns, so this reason is out.
...and of course slightly slower UPDATEs since the index needs to be updated. Also, the index needs to be cached, its gonna use some RAM, etc.
Getting the rows pre-sorted for convenient aggregation.
This can matter here, mostly if it avoids a huge sort. However, if it avoids a hash-aggregate, which super fast anyway, not so useful.
Now, if you have relatively few distinct values of id_variable... say, enough to fit into a hash-aggregate, which can be a sizable amount, depends on your work_mem... then it'll be difficult to beat it...
If the table is not updated often, or is insert-only, and you need the statistics often, consider a materialized view (keep min/max/avg for each id_variable in a separate table, and keep them updated on each insert). Updating the mat-view takes time, so this is a tradeoff if you need the stats very often.
You could keep your stats in cache if you don't mind them being stale.
Or, if your table has tons of old data, you could partition it, and keep the min/max/sum/count for the old read-only partition, and only compute the stats on the new stuff.

"Big" data in a JSONB column

I have a table with a metadata column (JSONB). Sometimes I run queries on this column. Example:
select * from "a" where metadata->'b'->'c' is not null
This column has always just small JSON objects <1KB. But for some records (less than 0.5%), it can be >500KB, because some sub-sub-properties have many data.
Today, I only have ~1000 records, everything works fine. But I think I will have more records soon, I don't know if having some big data (I don't speak about "Big Data" of course!) will have a global impact on performance. Is 500KB "big" for postgres and is it "hard" to parse? Maybe my question is too vague, I can edit if required. In other words:
Is having some (<0.5%) bigger entries in a JSONB column affect noticeably global performance of JSON queries?
Side note: assuming the "big" data is in metadata->'c'->'d' I don't run any queries to this particular property. Queries are always done on "small/common" properties. But the "big" properties still exists.
It is a theoretical question, so I hope a generic answer will satisfy.
If you need numbers, I recommend that you run some performance tests. It shouldn't be too hard to generate some large jsonb objects for testing.
As long as the data are jsonb and not json, operations like metadata->'b'->'c' will be fast. Where you could lose time is when a large jsonb value is loaded and uncompressed from the TOAST table (“detoasted”).
You can avoid that problem by creating an index on the expression. Then an index scan does not have to detoast the jsonb and hence will be fast, no matter how big the jsonb is.
So I think that you will not encounter problems with performance as long as you can index for the queries.
You should pay attention to the total size of your database. If the expected size is very large, backups will become a challenge. It helsp to design and plan how to discard old and unused data. That is an aspect commonly ignored during application design that tends to cause headaches years later.

ef-core - bool index vs historical table

I´m using aspnet-core, ef-core with sql server. I have an 'order' entity. As I'm expecting the orders table to be large and a the most frequent query would get the active orders only for certain customer (active orders are just a tiny fraction of the whole table) I like to optimize the speed of the query but I can decide from this two approaches:
1) I don't know if this is possible as I haven't done this before, but I was thinking about creating a Boolean column named 'IsActive' and make it an index thus when querying only Active orders would be faster.
2) When an order becomes not active, move the order to another table, i.e HistoricalOrders, thus keeping the orders table small.
Which of the two would have better results?, or none of this is a good solution and a third approach could be suggested?
If you want to partition away cold data then a leading boolean index column is a valid way to do that. That column must be added to all indexes that you want to hot/cold partition. This includes the clustered index. This is quite awkward. The query optimizer requires that you add a dummy predicate where IsActive IN (0, 1) to make it able to still seek on such indexes. Of course, this will now also touch the cold data. So you probably need to know the concrete value of IsActive or try the 1 value first and be sure that it matches 99% of the time.
Depending on the schema this can be impractical. I have never seen a good case for this but I'm sure it exists.
A different way to do that is to use partitioning. Here, the query optimizer is used to probing multiple partitions anyway but again you don't want it to probe cold data. Even if it does not find anything this will pull pages into memory making the partitioning moot.
The historical table idea (e.g. HistoricalOrders) is the same thing in different clothes.
So in order to make this work you need:
Modify all indexes that you are about (likely all), or partition, or create a history table.
Have a way to almost never need to probe the cold partition.
I think (2) kills it for most cases.
Among the 3 solutions I'd probably pick the indexing solution because it is simplest. If you are worried about people making mistakes by writing bad queries all the time, I'd pick a separate table. That makes mistakes hard but makes the code quite awkward.
Note, that many indexes are already naturally partitioned. Indexes on the identity column or on an increasing datetime column are hot at the end and cold elsewhere. An index on (OrderStatus INT, CreateDateTime datetime2) would have one hot spot per order status and be cold otherwise. So those are already solved.
Some further discussion.
Before think about the new table HistoricalOrders,Just create a column name IsActive and test it with your data.You don't need to make it as index column.B'cos Indexes eat up storage and it slows down writes and updates.So we must very careful when we create an index.When you query the data do it as shown below.On the below query where data selection (or filter) is done on the SQL srever side (IQueryable).So it is very fast.
Note : Use AsNoTracking() too.It will boost the perfromnace too.
var activeOrders =_context.Set<Orders>().Where(o => o.IsActive == true).AsNoTracking()
.ToList();
Reference : AsNoTracking()

PostgreSQL - Clustering never completes - long key?

I am having problems with clustering a table where the key consists of one char(23) field and two TimeStamp fields. The char(23) field contains Alpha-Numeric values. The clustering operation never finishes. I have let it run for 24 hours and it still did not finish.
Has anyone run into this kind of problem before? Is my theory that the reason is the long key field makes sense? We have dealt with much larger tables that do not have long keys and we have always been able to perform DB operations on them without any problem. That makes me think that it might have to do with the size of the key in this case.
Cluster rewrites the table so it must wait on locks. It is possible that it is never getting the lock it needs. Why are you setting varchar(64000)? Why not just unrestricted varchar? And how big is this index?
If size is a problem it has to be based on the index size not the key size. I don't know what the effect of toasted key attributes is on cluster because these are moved into extended storage. TOAST might complicate CLUSTER and I have never heard of anyone clustering on a TOASTed attribute. It wouldn't make much sense to do so. TOASTing is necessary for any attribute more than 4k in size.
A better option is to create an index for the values without the possibly toasted value, and then cluster on that. That should give you something very close to what you'd get otherwise.