designing table for high insert rate

designing table for high insert rate - tsql

I would like some sugestion on how to design a table that gets like 10 to 50 million inserts a day and needs to respond quickly to selects... should i use indexes? or the overhead cost would be too great?
edit:Im not worried about the transaction volume... this is actually a assigment... and i need to figure out a design to a table that "must respond very well to selects not based on the primary key, knowing that this table will receive a enourmous amount of inserts day-in-day-out"

definitely. At least the primary key, foreign keys, and then whatever you need for reporting, just don't overdo it. 10k-50k inserts a day is not a problem. If it was like, I don't know, a million inserts then you could start thinking of having separate tables, data dictionaries and what not, but for your needs I wouldn't worry.

Even if you did 50,000 per day and your day was an 8 hour work day, that would still be less than two inserts per second on average. I suppose you might get peaks that are much higher than that, but in general, SQL Server can handle much higher transaction rates than what you seem to have.
If your table is fairly wide (lots of columns or a few really long ones) then you might want to consider clustering by a surrogate (IDENTITY) column. Your volumes aren't enough to make for a bad hot-spot at the end of the table. In combination with this, use indexes for any keys needed for data consistency (i.e. FK's) and retrieval (PK, natural key, etc). Be careful about setting the fill factor on your indexes and consider rebuilding them during a periodic down-time window.
If your table is fairly narrow, then you could possibly consider clustering on the natural key, but you'll have to make sure that your response time expectations can be met.

Best rate is PK sort the same as the insert order and no other indexes. 10-50 thousand a day is not that much. If only inserts then I don't see any down side to dirty reads.
If you are optimizing for select then use row level locking for inserts.
Measure index fragmentation. Defragment the indexes on a regular basis with a proper fill factor. Fill factor determined the how fast the indexes fragment and how often you defragment.

Related

How is the performance of Postgres / TimescaleDB unique composite keys?

To make it short:
I have a TimescaleDB table with multiple rows:
from
to
code
point
...
There are a few more rows.
I need to keep the rows unique so that the combination of the above rows (from, to, code, point) is unique. The rows can have multiple identical entries, but the combination must be unique.
How would that affect the performance? There would be relative many insertions (a few hundred insertions per minute).
EDIT:
The project is still relatively new so we can make changes to the database. There are alternatives that I can consider. I also do not know if there will only be a few hundred insertions per minute or if the volume will scale with time. I know the question leaves space for debates and I am grateful for every answer, but I asked how composite keys affect performance so I can make an informed decision.

Do you need to add an index on a partitioned table (postgres 11)?

My team is looking at moving our non partitioned table with ~1TB of data over to a partitioned table.
We would be using range partitioning based on a timestamp column.
One thing I don't understand is whether we need to add an index on the timestamp column if it's being used as the partition key. If we make our partitions quite small (e.g. partition for every day), would this act in a similar way to an index?
We would only be doing queries on a maximum resolution of one day.
I am reluctant to add an index as we've tried this in the past and it never completed (probably because we didn't turn off writes. Not really an option to turn off writes for an extended period).

Your feeling is right: omitting the index on the partitioning column is one of the few places where partitioning actually makes queries faster.
You can then get away with a sequential scan of a single partition, and you don't have to maintain the index with every data modifying statement.
The other advantage is that partitioning makes mass deletion of data (along the partition boundaries) so much more efficient. And finally, autovacuum's job will become easier.
Two points about partitioning:
Upgrade to v12; there have been substantial performance improvements that concern partitioning.
Don't use too many partitions. With v12, you can probably go up to a few thousand, in earlier versions you will get performance problems earlier on.

Optimizing aggregation function and ordering in PostgreSQL

I have the following table 'medicion' with the followings fields:
id_variable[int](PK),
id_departamento[int](PK),
fecha [date](PK),
valor [number]`.
So, I want to get the minimum, maximum and the average of valor grouping all that data by id_variable. So my query is:
SELECT AVG(valor), MIN(valor), MAX(valor)
FROM medicion
GROUP BY id_variable;
Knowing that by default PostgreSQL builds an index for the primary key
(id_departamento, id_variable, fecha)
how can I optimize this query?, should I create a new index only by id_variable or the default index is valid in this query?
Thanks!

Since there is an avg(), and one needs all the values to compute an average, it's going to read the whole table. Unless you use a WHERE, but there is no WHERE, so I presume you want global statistics.
The only things an extra covering index brings are:
Not reading the entire table.
This could be beneficial if there was, say, 50 columns, or TEXTs which make the table file huge. In this case reading the whole table just to average a few int's would need to grind in tons of useless stuff from disk.
I mean, covering indexes are awesome when you want to snipe one column or two out of a huge table, and keep the small column set in cache. But this is not the case here, you only got small columns, so this reason is out.
...and of course slightly slower UPDATEs since the index needs to be updated. Also, the index needs to be cached, its gonna use some RAM, etc.
Getting the rows pre-sorted for convenient aggregation.
This can matter here, mostly if it avoids a huge sort. However, if it avoids a hash-aggregate, which super fast anyway, not so useful.
Now, if you have relatively few distinct values of id_variable... say, enough to fit into a hash-aggregate, which can be a sizable amount, depends on your work_mem... then it'll be difficult to beat it...
If the table is not updated often, or is insert-only, and you need the statistics often, consider a materialized view (keep min/max/avg for each id_variable in a separate table, and keep them updated on each insert). Updating the mat-view takes time, so this is a tradeoff if you need the stats very often.
You could keep your stats in cache if you don't mind them being stale.
Or, if your table has tons of old data, you could partition it, and keep the min/max/sum/count for the old read-only partition, and only compute the stats on the new stuff.

Slow select from one billion rows GreenPlum DB

I've created the following table on GreenPlum:
CREATE TABLE data."CDR"
(
mcc text,
mnc text,
lac text,
cell text,
from_number text,
to_number text,
cdr_time timestamp without time zone
)
WITH (
OIDS = FALSE,appendonly=true, orientation=column,compresstype=quicklz, compresslevel=1
)
DISTRIBUTED BY (from_number);
I've loaded one billion rows to this table but every query works very slow.
I need to do queries on all fields (not only one),
What can I do to speed up my queries?
Using PARTITION? using indexes?
maybe using a different DB like Cassandra or Hadoop?

This highly depends on the actual queries you are doing and what your hardware setup looks like.
Since you are querying all the fields the selectivity gained by going columnar orientation is probably hurting you more than helping, as you needs to scan all the data anyway. I would remove columnar orientation.
Generally speaking indexes don't help in a Greenplum system. Usually the amount of hardware that is involved tends to make scanning the data directory faster than doing index lookups.
Partitioning could be a great help but there would need to be a better understanding of the data. You are probably accessing specific time intervals so creating a partitioning scheme around cdr_time could eliminate the scan of data not needed for the result. The last thing I would worry about is indexes.
Your distribution by from_number could have an impact on query speed. The system will hash the data based on from_number so if you are querying selectively on the from_number the data will only be returned by the node that has it and you won't be leveraging the parallel nature of the system and spreading the request across all of the nodes. Unless you are joining to other tables on from_number, which allows the joins to be collocated and performed within the node, I would change that to be distributed RANDOMLY.
On top of all of that there is the question of what the hardware is and if you have a proper amount of segments setup and resources to feed them. Essentially every segment is a database. Good hardware can handle multiple segments per node, but if you are doing this on a light hardware you need to find the sweet spot where number of segments matches what the underlying system can provide.

#Dor,
I have same type of data where CDR info is stored for a telecom company, and daily 10-12 millions rows inserted and also heavy queries running on those CDRs related tables, I was also facing the same issue last year, and i have created partitions on those tables on the CDR timings column.
As per My understanding GP creates physical tables for each partition whereas logical tables created in other RDBMS. After this I got better performance with all SELECTs on these tables. Also I think you should convert text datatype to Character Varying for all columns (if text is really not required) I felt DB operations on Text field is very slow(specially order by, group by)
index will help you depends on your queries in my case i have huge inserts so i didnt try yet
If you are selecting all the columns in select so no need of Column Oriented table
Regards

Billions rows in PostgreSql: partition or not to partition?

What i have:
Simple server with one xeon with 8 logic cores, 16 gb ram, mdadm raid1 of 2x 7200rpm drives.
PostgreSql
A lot of data to work with. Up to 30 millions of rows are being imported per day.
Time - complex queries can be executed up to an hour
Simplified schema of table, that will be very big:
id| integer | not null default nextval('table_id_seq'::regclass)
url_id | integer | not null
domain_id | integer | not null
position | integer | not null
The problem with the schema above is that I don't have the exact answer on how to partition it.
Data for all periods is going to be used (NO queries will have date filters).
I thought about partitioning on "domain_id" field, but the problem is that it is hard to predict how many rows each partition will have.
My main question is:
Does is make sense to partition data if i don't use partition pruning and i am not going to delete old data?
What will be pros/cons of that ?
How will degrade my import speed, if i won't do partitioning?
Another question related to normalization:
Should url be exported to another table?
Pros of normalization
Table is going to have rows with average size of 20-30 bytes.
Joins on "url_id" are supposed to be much faster than on "url" field
Pros of denormalization
Data can be imported much, much faster, as i don't have to make lookup into "url" table before each insert.
Can anybody give me any advice? Thanks!

Partitioning is most useful if you are going to either have selection criteria in most queries which allow the planner to skip access to most of the partitions most of the time, or if you want to periodically purge all rows that are assigned to a partition, or both. (Dropping a table is a very fast way to delete a large number of rows!) I have heard of people hitting a threshold where partitioning helped keep indexes shallower, and therefore boost performance; but really that gets back to the first point, because you effectively move the first level of the index tree to another place -- it still has to happen.
On the face of it, it doesn't sound like partitioning will help.
Normalization, on the other hand, may improve performance more than you expect; by keeping all those rows narrower, you can get more of them into each page, reducing overall disk access. I would do proper 3rd normal form normalization, and only deviate from that based on evidence that it would help. If you see a performance problem while you still have disk space for a second copy of the data, try creating a denormalized table and seeing how performance is compared to the normalized version.

I think it makes sense, depending on your use cases. I don't know how far back in time your 30B row history goes, but it makes sense to partition if your transactional database doesn't need more than a few of the partitions you decide on.
For example, partitioning by month makes perfect sense if you only query for two months' worth of data at a time. The other ten months of the year can be moved into a reporting warehouse, keeping the transactional store smaller.
There are restrictions on the fields you can use in the partition. You'll have to be careful with those.
Get a performance baseline, do your partition, and remeasure to check for performance impacts.

With the given amount of data in mind, you'll be waiting on IO mostly. If possible, perform some tests with different HW configurations trying to get best IO figures for your scenarios. IMHO, 2 disks will not be enough after a while, unless there's something else behind the scenes.
Your table will be growing daily with a known ratio. And most likely it will be queried daily. As you haven't mentioned data being purged out (if it will be, then do partition it), this means that queries will run slower each day. At some point in time you'll start looking at how to optimize your queries. One of the possibilities is to parallelize query on the application level. But here some conditions should be met:
your table should be partitioned in order to parallelize queries;
HW should be capable of delivering the requested amount of IO in N parallel streams.
All answers should be given by the performance tests of different setups.
And as others mentioned, there're more benefits for DBA in partitioned tables, so I, personally, would go for partitioning any table that is expected to receive more then 5M rows per interval, be it day, week or month.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse