How is the performance of Postgres / TimescaleDB unique composite keys? - postgresql

To make it short:
I have a TimescaleDB table with multiple rows:
from
to
code
point
...
There are a few more rows.
I need to keep the rows unique so that the combination of the above rows (from, to, code, point) is unique. The rows can have multiple identical entries, but the combination must be unique.
How would that affect the performance? There would be relative many insertions (a few hundred insertions per minute).
EDIT:
The project is still relatively new so we can make changes to the database. There are alternatives that I can consider. I also do not know if there will only be a few hundred insertions per minute or if the volume will scale with time. I know the question leaves space for debates and I am grateful for every answer, but I asked how composite keys affect performance so I can make an informed decision.

Related

Criteria and strategy for partitioning a large table in Postgres

We are looking into migrating our app to a multi-tenant database. Currently, the app runs with one database per tenant. There are currently around 400 tenants. When combined, the largest table would have around 1 billion rows and would grow as tenants are added. Size by tenant varies wildly, with one tenant alone having 180 million records in that table, some having less than a million. There are a few other tables in the hundred millions, most tables would have much less. My main concerns revolve around planning for scalability for the large tables, and I'll focus on the largest one. The parameters for it are that it's a linking/many-to-many table with basic audit fields for created by and created date (though am questioning if those are even necessary for this one). Date/time is not relevant to this, this is an assignment table and applies at all times. Records can get deleted or inserted, not updated, some times in bulk, probably not frequently but can happen at any time. Data cardinality would be relatively high on both foreign keys I think, though I'm not sure what constitutes high cardinality as a ratio to total number of records. For some perspective, the tenant with 180 million records has around 100,000 distinct records for one foreign key and 165,000 for the other. Meanwhile, another client has around 180,000 records, with 500 distinct values in one field and 5000 in the other. So as I said, a lot of variability.
Would the kind of table I described above (billions of rows, high data cardinality, not time based, tenant segmented, bulk insert/deletes at any time) in the kind of scenario I described (400+ tenants with varying amounts of data) be a good candidate for partitioning? The reason I'm concerned about this now is that I've read in a number of places that partitioning is something that can be much less painful to deal with if you plan for it ahead of time rather than try to partition later after the table is huge and harder to work with without requiring down-time or jumping through hoops. At this point, my main concern is not so much querying the data, I tested with a table with 1 billion records and with a proper index select queries run very fast. I'm more worried about concurrency with the read/write/delete, running into blocking because of locks, etc. If partitioning is warranted, what would a good strategy be? Partition by tenant? Just partition large ones and keep smaller ones bundled together?
Given that you said that query performance is not an issue, the only reason I can think of to consider partitioning is to make mass purging easier to accomplish.
Do you have contractual or legal retention policies in place?
The most common scenario would be using time periods as your partition key so that rolling-off old data is simply a matter of dropping partitions, but since you clearly state that date/time is not relevant, I do not see how that would help.
Is it common for you to roll-on/roll-off individual customers? Is there a purging or retention requirement? If so, then partitioning by customer, no matter how imbalanced the partitions would be, would make sense since you could purge a large customer's data without affecting other customers' access to their data.
As for any concurrency issues, partitioning by customer should help contain these problems within a specific customer that is showing heavy activity.
I recommend testing this thoroughly for a few reasons:
I have not seen multiple active partitions in action because I have worked only with time series partitions
I have not looked deeply into PostgreSQL 12's foreign key enhancements and wonder whether a foreign key with a partitioned table on both sides would complicate dropping parititons
I have never explored the practical limits of the number of partitions a database could contain
I may be reading things from my experience into your question about partitioning, but have you considered a schema per customer?

Do you need to add an index on a partitioned table (postgres 11)?

My team is looking at moving our non partitioned table with ~1TB of data over to a partitioned table.
We would be using range partitioning based on a timestamp column.
One thing I don't understand is whether we need to add an index on the timestamp column if it's being used as the partition key. If we make our partitions quite small (e.g. partition for every day), would this act in a similar way to an index?
We would only be doing queries on a maximum resolution of one day.
I am reluctant to add an index as we've tried this in the past and it never completed (probably because we didn't turn off writes. Not really an option to turn off writes for an extended period).
Your feeling is right: omitting the index on the partitioning column is one of the few places where partitioning actually makes queries faster.
You can then get away with a sequential scan of a single partition, and you don't have to maintain the index with every data modifying statement.
The other advantage is that partitioning makes mass deletion of data (along the partition boundaries) so much more efficient. And finally, autovacuum's job will become easier.
Two points about partitioning:
Upgrade to v12; there have been substantial performance improvements that concern partitioning.
Don't use too many partitions. With v12, you can probably go up to a few thousand, in earlier versions you will get performance problems earlier on.

Optimizing aggregation function and ordering in PostgreSQL

I have the following table 'medicion' with the followings fields:
id_variable[int](PK),
id_departamento[int](PK),
fecha [date](PK),
valor [number]`.
So, I want to get the minimum, maximum and the average of valor grouping all that data by id_variable. So my query is:
SELECT AVG(valor), MIN(valor), MAX(valor)
FROM medicion
GROUP BY id_variable;
Knowing that by default PostgreSQL builds an index for the primary key
(id_departamento, id_variable, fecha)
how can I optimize this query?, should I create a new index only by id_variable or the default index is valid in this query?
Thanks!
Since there is an avg(), and one needs all the values to compute an average, it's going to read the whole table. Unless you use a WHERE, but there is no WHERE, so I presume you want global statistics.
The only things an extra covering index brings are:
Not reading the entire table.
This could be beneficial if there was, say, 50 columns, or TEXTs which make the table file huge. In this case reading the whole table just to average a few int's would need to grind in tons of useless stuff from disk.
I mean, covering indexes are awesome when you want to snipe one column or two out of a huge table, and keep the small column set in cache. But this is not the case here, you only got small columns, so this reason is out.
...and of course slightly slower UPDATEs since the index needs to be updated. Also, the index needs to be cached, its gonna use some RAM, etc.
Getting the rows pre-sorted for convenient aggregation.
This can matter here, mostly if it avoids a huge sort. However, if it avoids a hash-aggregate, which super fast anyway, not so useful.
Now, if you have relatively few distinct values of id_variable... say, enough to fit into a hash-aggregate, which can be a sizable amount, depends on your work_mem... then it'll be difficult to beat it...
If the table is not updated often, or is insert-only, and you need the statistics often, consider a materialized view (keep min/max/avg for each id_variable in a separate table, and keep them updated on each insert). Updating the mat-view takes time, so this is a tradeoff if you need the stats very often.
You could keep your stats in cache if you don't mind them being stale.
Or, if your table has tons of old data, you could partition it, and keep the min/max/sum/count for the old read-only partition, and only compute the stats on the new stuff.

designing table for high insert rate

I would like some sugestion on how to design a table that gets like 10 to 50 million inserts a day and needs to respond quickly to selects... should i use indexes? or the overhead cost would be too great?
edit:Im not worried about the transaction volume... this is actually a assigment... and i need to figure out a design to a table that "must respond very well to selects not based on the primary key, knowing that this table will receive a enourmous amount of inserts day-in-day-out"
definitely. At least the primary key, foreign keys, and then whatever you need for reporting, just don't overdo it. 10k-50k inserts a day is not a problem. If it was like, I don't know, a million inserts then you could start thinking of having separate tables, data dictionaries and what not, but for your needs I wouldn't worry.
Even if you did 50,000 per day and your day was an 8 hour work day, that would still be less than two inserts per second on average. I suppose you might get peaks that are much higher than that, but in general, SQL Server can handle much higher transaction rates than what you seem to have.
If your table is fairly wide (lots of columns or a few really long ones) then you might want to consider clustering by a surrogate (IDENTITY) column. Your volumes aren't enough to make for a bad hot-spot at the end of the table. In combination with this, use indexes for any keys needed for data consistency (i.e. FK's) and retrieval (PK, natural key, etc). Be careful about setting the fill factor on your indexes and consider rebuilding them during a periodic down-time window.
If your table is fairly narrow, then you could possibly consider clustering on the natural key, but you'll have to make sure that your response time expectations can be met.
Best rate is PK sort the same as the insert order and no other indexes. 10-50 thousand a day is not that much. If only inserts then I don't see any down side to dirty reads.
If you are optimizing for select then use row level locking for inserts.
Measure index fragmentation. Defragment the indexes on a regular basis with a proper fill factor. Fill factor determined the how fast the indexes fragment and how often you defragment.

Don't more than a few dozen partitions make sense?

I store time-series simulation results in PostgreSQL.
The db schema is like this.
table SimulationInfo (
simulation_id integer primary key,
simulation_property1,
simulation_property2,
....
)
table SimulationResult ( // The size of one row would be around 100 bytes
simulation_id integer,
res_date Date,
res_value1,
res_value2,
...
res_value9,
primary key (simulation_id, res_date)
)
I usually query data based on simulation_id and res_date.
I partitioned the SimulationResult table into 200 sub-tables based on the range value of simulation_id. A fully filled sub table has 10 ~ 15 millions rows. Currently about 70 sub-tables are fully filled, and the database size is more than 100 gb. The total 200 sub tables would be filled soon, and when it happens, I need to add more sub tables.
But I read this answers, which says more than a few dozen partitions does not make sense. So my questions are like below.
more than a few dozen partitions not make sense? why?
I checked the execution plan on my 200 sub-tables, and it scan only the relevant sub-table. So i guessed more partitions with smaller each sub-table must be better.
if number of partitions should be limited, like 50, then is it no problem to have billions rows in one table? How big one table can be without big problem given the schema like mine?
It's probably unwise to have that many partitions, yes. The main reason to have partitions at all is not to make indexed queries faster (which they are not, for the most part), but to improve performance for queries that have to sequentially scan the table based on constraints that can be proved to not hold for some of the partitions; and to improve maintenance operations (like vacuum, or deleting large batches of old data which can be achieved by truncating a partition in certain setups, and such).
Maybe instead of using ranges of simulation_id (which means you need more and more partitions all the time), you could partition using a hash of it. That way all partitions grow at a similar rate, and there's a fixed number of partitions.
The problem with too many partitions is that the system is not prepared to deal with locking too many objects, for example. Maybe 200 work fine, but it won't scale well when you reach a thousand and beyond (which doesn't sound that unlikely given your description).
There's no problem with having billions of rows per partition.
All that said, there are obviously particular concerns that apply to each scenario. It all depends on the queries you're going to run, and what you plan to do with the data long-term (i.e. are you going to keep it all, archive it, delete the oldest, ...?)