I am developing an application using a virtual private database pattern in postgres.
So every user gets his id and all rows of this user will hold this id to be separated from others. this id should also be part of the primary key. In addition every row has to have a id which is unique in the scope of the user. This id will be the other part of the primary key.
If we have to scale this across multiple servers we can also append a third column to the pk identifying the shard this id was generated at.
My question now is how to create per user unique ids. I came along with some options which i am not sure about all the implications. The 2 solutions that seem most promising to me are:
creating one sequence per user:
this can be done automatically, using a trigger, every time a user is created. This is for sure transaction safe and I think it should be quite ok in terms of performance.
What I am worried about is that this has to work for a lot of users (100k+) and I don't know how postgres will deal with 100k+ sequences. I tried to find out how sequences are implemented but without luck.
counter in user table:
keep all users in a table with a field holding the latest id given for this user.
when a user starts a transaction I can lock the row in the user table and create a temp sequence with the latest id from the user table as a starting value. this sequence can then be used to supply ids for new entries.
before exiting the transaction the current value has to be written back to the user table and the lock has to be released.
If another transaction from the same user tries to concurrently insert rows it will stall until the first transaction releases its lock on the user table.
This way I do not need thousands of sequences and i don't think that ther will be concurrent accesses from one user frequently (the application has oltp character - so there will not be long lasting transactions) and even if this happens it will just stall for about a second and not hurt anything.
The second part of my question is if I should just use 2 columns (or maybe three if the shard_id joins the game) and make them a composite pk or if I should put them together in one column. I think handling will be way easier having them in separate columns but what does performance look like? Lets assume both values are 32bit integers - is it better tho have 2 int columns in an index or 1 bigint column?
thx for all answers,
alex
I do not think sequences would be scalable to the level you want (100k sequences). A sequence is implemented as a relation with just one row in it.
Each sequence will appear in the system catalog (pg_class) which also contains all of the tables, views, etc. Having 100k rows there is sure to slow the system down dramatically. The amount of memory required to hold all of the data structures associated with these sequence relations would be also be large.
Your second idea might be more practical, if combined with temporary sequences, might be more scalable.
For your second question, I don't think a composite key would be any worse than a single column key, so I would go with whatever matches your functional needs.
Related
Each user will use this data at least 15 times when they are logged in. So READ is more important.
So i have two approaches, i know this is a rookie question but I am just confused between the options:
Approach 1
Have multiple rows with less columns,
id data user
1 task1 1
2 task2 1
3 task3 1
4 task1 7
And Approach 2
Have multiple columns with single row
id task1 task2 task3 user
1 True True True 1
2 True False False 7
Please suggest which is a best approach, everything is heavily based on READ only. So i will literally fetching all this to calculate some permission and action. So these will be used on some major routes which users often visit.
I think you're doing some premature optimization here.
It's very rare that a database slows down because of small quick queries like this. What gets you is usually the big search query when it misbehaves or if the indices aren't optimal for the job.
As everyone said, approach 2 is terrible because you need to add columns every time you want to add a new task. That's a typical red flag for a bad design. In addition, if you want to search these columns, you'll also need to add indices on them.
Approach 1 is the usual way, and it works well. The typical problem with this one is when you want to search based on attributes, because you have to join once per attribute, which doesn't optimize well.
In this case however, since you say this will be read at login, I guess this is about storing user rights or tasks associated with users. Perhaps you will select this data and cache it in the session so it only needs to be fetched once at login. So in this case, you should worry more about the queries that occur on every page, rather than the query that only occurs at login.
Anyway. Approach 1 has one gotcha: if the data isn't clustered, and the lines for one user sit in different pages in the table file on your disk, then it will need one IO per line. That's not really a problem with SSDs, but well.
Fortunately, postgres supports two ways of avoiding that: cluster, and index-only scans.
CLUSTER just orders the table on disk in the order of the index you specify. Since you need an index on (user,task) anyway to quickly find if a user has a task, you can cluster on that index, and all the lines for a user will be in the same place on disk, so only one IO will be needed to fetch them. However CLUSTER locks the table, so it's best to use it during scheduled maintenance. If you table has only a few million rows, and if you set maintenance_work_mem high enough, it will only take a couple seconds.
The other way is index-only scans. If you have an index on (user,task) and you run SELECT user,task WHERE user=... then postgres will use an index-only scan, and in the index data is ordered by (user,task) which means it will do one IO to get the page with the first row, and then the next rows for that user will be stored just afterward in index order, on the same page, so they're already loaded and very fast to access.
Notes:
Since you have no other columns, I'll assume (user,task) is unique, because it makes no sense to have duplicates in this case. So that can be your primary key, and you can drop the id and associated index. You don't have to use a sequence on every table if the data gives you a nice natural primary key.
"task" would usually be a foreign key to another table.
I have legacy, but pretty big(~25Gb) database questionably designed. The "pattern" which is commonly used across whole database boils down to the following:
separate logical parts into different tables (journal_1, journal_2, journal_n)
all tables have a unique bigserial/autoincrement field (journal_id_seq_1, journal_id_seq_2, journal_id_seq_n)
all tables have one or several foreign keys to one or several reference tables(journal tables have 2 foreign keys, group of tables with another structure(log_1, log_2, log_n) referenced just to one)
I'm extremely curious (actually near to panic :) in what happens if there are about 50 thousand such tables(Now is "just" about 15k).
My idea is get everything together(tables with identical structure) in one huge table with a common name(let's say journal), add a column with a journal_id(extracted from a suffix journal_{1|2|3}) partitioned by this column and obviously create partition tables for each table with same naming convention. Moreover, bigserial fields need to be converted into regular bigints, but I still need keep sequences for each partition and manually call nextval on every insert. Primary key also need to be extended with journal_id field additionally to seq_id. Finally, I see a bonus in a sharding which can be applied to partitions when database becomes enormous.
Please, share you thoughts about this strategy and especially about foreign keys. For now we need at least max_locks_per_transaction limited to 512, otherwise pg_dump is failed with
ERROR: out of shared memory HINT: You might need to increase max_locks_per_transaction. pg_dump: error: query was: LOCK TABLE. Besides locking nightmare, as far as I known, Postgres has limits for relations per database(the total number is huge but not unlimited). Do I need to make foreign keys for each partition table or just a part(certain rows) of a partitionED(general) table will be locked on insert, delete or update because all partitions are just "storages" but not a real relational entites?
Thank you in advance.
15K tables == Gag!
Partitioning is not likely to be any better than multiple tables.
Neither provides any performance benefits except in rare cases.
Let's investigate the future need for sharding. That, alone, may justify the existence of journal_nnn. In this, some journals would be on one machine, some on another machine, etc. Are all journals in active use? Or are most of them "old" and not really used?
The PRIMARY KEY can be a composite of two (or more) columns.
AUTO_INCREMENT has some advantages over manually creating "serial numbers". (However, the Question does not have enough details for me to elaborate.)
FOREIGN KEYs are two things: an implied INDEX (good for performance) and a constraint (good for integrity). In a well-debugged app, the integrity checks are unnecessary overhead. They must be abandoned in partitioning and probably in sharding.
Why do you use partitioning for such a small database? Your average table is less then 2MB in size, that's really really small.
Get rid of the partitioning and your problems are gone.
Having 50000 tables starts to get painful, and it makes no sense with a small database like this. The same holds for partitioning – after all, partitions are tables with a side job.
I would define only one table per object type.
About the auto-generated primary key numbers: make a combined primary key that consists of the old primary key and the table number (journal_id). For new entries, use a sequence that is initialized higher than the existing maximum of all tables.
Use case
I need to store texts assigned to some entity. It's important to note that I always only care about the most current texts that have been assigned to that entity. In case new texts are inserted, older ones might even be deleted. And that "might" is the problem, because I can't rely that really only the most current texts are available.
The only thing I'm unsure about how to design is the case that some INSERT can provide either 1 or N texts for some entity. In the latter case, I need to know which N texts belong to the most current INSERT done for one and the same entity. Additionally, inserting N instead of 1 text will be pretty rare.
I know that things could be implemented using two different tables: One calculating some main-ID and the other mapping individual texts with their own IDs to that main-ID. Because multiple texts should happen rarely and a one table design already provides columns which could easily be reused for grouping multiple texts together, I prefer using one only.
Additionally, I thought of which concept would make a good grouping key in general as well. I somewhat doubt that others really always implement the two table-approach only and therefore created this question to get a better understanding. Of course I simply might be wrong and everybody avoids such "hacks" at all costs. :-)
Possible keys
Transaction-local timestamp
Postgres supports the concept of a transaction-local timestamp using current_timestamp. I need to have one of those to store when the texts have been stored anyway, so they might be used for grouping as well?
While there's in theory the probability of collisions, timestamps have a resolution of 1 microsecond, which is in practice enough for my needs. Texts are uploaded by human users and it is very unlikely that multiple humans upload texts for the same entity at the same time at all.
That timestamp won't be used as a primary key of course, only to group multiple texts if necessary.
Transaction-ID
Postgres supports txid_current to get the ID of the current transaction, which should be ever increasing over the lifetime of the current installation. The good thing is that this value is always available and the app doesn't need to do anything on it's own. But things can easily break in case of restores, can't they? Can TXIDs e.g. occur again with the restored cluster?
People knowing things better than me write the following:
Do not use the transaction ID for anything at the application level. It is an internal system level field. Whatever you are trying to do, it's likely that the transaction ID is not the right way to do it.
https://stackoverflow.com/a/32644144/2055163
You shouldn't be using the transaction ID for anything except an identifier for a transaction. You can't even assume that a lower transaction ID is an older transaction, due to transaction ID wrap-around.
https://stackoverflow.com/a/20602796/2055163
Isn't my grouping a valid use case for wanting to know the ID of the current transaction?
Custom sequence
Grouping simply needs a unique key per transaction, which can be achieved using a custom sequence for that purpose only. I don't see any downsides, its values consume less storage than e.g. UUIDs and can easily be queried.
Reusing first unique ID
The table to store the texts contains a serial-column, so each inserted text gets an individual ID already. Therefore, the ID of the first inserted text could simply always be additionally reused as the group-key for all later added texts.
In case of only inserting one text, one should easily be able to use currval and doesn't even need to explicitly query the ID of the inserted row. In case of multiple texts this doesn't work anymore, though, because currval would provide updated IDs instead of the first one per transaction only. So some special handling would be necessary.
APP-generated random UUID
Each request to store multiple texts could simply generate some UUID and group by that. The mainly used database Postgres even supports a corresponding data type.
I mainly see too downsides with this: It feels really hacky already and consumes more space than necessary. OTOH, compared to the texts to store, the latter might simply be negligible.
I´m using aspnet-core, ef-core with sql server. I have an 'order' entity. As I'm expecting the orders table to be large and a the most frequent query would get the active orders only for certain customer (active orders are just a tiny fraction of the whole table) I like to optimize the speed of the query but I can decide from this two approaches:
1) I don't know if this is possible as I haven't done this before, but I was thinking about creating a Boolean column named 'IsActive' and make it an index thus when querying only Active orders would be faster.
2) When an order becomes not active, move the order to another table, i.e HistoricalOrders, thus keeping the orders table small.
Which of the two would have better results?, or none of this is a good solution and a third approach could be suggested?
If you want to partition away cold data then a leading boolean index column is a valid way to do that. That column must be added to all indexes that you want to hot/cold partition. This includes the clustered index. This is quite awkward. The query optimizer requires that you add a dummy predicate where IsActive IN (0, 1) to make it able to still seek on such indexes. Of course, this will now also touch the cold data. So you probably need to know the concrete value of IsActive or try the 1 value first and be sure that it matches 99% of the time.
Depending on the schema this can be impractical. I have never seen a good case for this but I'm sure it exists.
A different way to do that is to use partitioning. Here, the query optimizer is used to probing multiple partitions anyway but again you don't want it to probe cold data. Even if it does not find anything this will pull pages into memory making the partitioning moot.
The historical table idea (e.g. HistoricalOrders) is the same thing in different clothes.
So in order to make this work you need:
Modify all indexes that you are about (likely all), or partition, or create a history table.
Have a way to almost never need to probe the cold partition.
I think (2) kills it for most cases.
Among the 3 solutions I'd probably pick the indexing solution because it is simplest. If you are worried about people making mistakes by writing bad queries all the time, I'd pick a separate table. That makes mistakes hard but makes the code quite awkward.
Note, that many indexes are already naturally partitioned. Indexes on the identity column or on an increasing datetime column are hot at the end and cold elsewhere. An index on (OrderStatus INT, CreateDateTime datetime2) would have one hot spot per order status and be cold otherwise. So those are already solved.
Some further discussion.
Before think about the new table HistoricalOrders,Just create a column name IsActive and test it with your data.You don't need to make it as index column.B'cos Indexes eat up storage and it slows down writes and updates.So we must very careful when we create an index.When you query the data do it as shown below.On the below query where data selection (or filter) is done on the SQL srever side (IQueryable).So it is very fast.
Note : Use AsNoTracking() too.It will boost the perfromnace too.
var activeOrders =_context.Set<Orders>().Where(o => o.IsActive == true).AsNoTracking()
.ToList();
Reference : AsNoTracking()
I want to insert some data in SQLite table with one column for keeping string values and other column for keeping sequence number.
SQLite documentation says that autoincrement does not guarantees the sequential insertion.
And i do not want to keep track of previously inserted sequence number.
Is there any way for storing data sequentially, without keeping track of previously inserted row?
The short answer is that you're right that the autoincrement documentation makes it clear that INTEGER PRIMARY KEY AUTOINCREMENT will be constantly increasing, though as you point out you using, not necessarily sequentially so. So you obviously have to either modify your code so it's not contingent on sequential values (which is probably the right course of action), or you have to maintain your own sequential identifier yourself. I'm sure that's not the answer you're looking for, but I think it's the practical reality of the situation.
Short answer: Stop worrying about gaps in AUTOINCREMENT id sequences. They are inevitable when dealing with transactional databases.
Long answer:
SQLite cannot guarantee that AUTOINCREMENT will always increase by one, and reason for this is transactions.
Say, you have 2 database connections that started 2 parallel transactions almost at the same time. First one acquired some AUTOINCREMENT id and it becomes previously used value +1. One tick later, second transaction acquired next id, which is now +2. Now imagine that first transaction rolls back for some reason (encounters some error, code decided to abort it, program crashed, etc.). After that, second transaction will commit id +2, creating a gap in id numbering.
Now, what if number of such parallel transactions is higher than 2? You cannot predict, and you also cannot tell currently running transactions to reuse ids that were not used for any reason.
If you insert data sequentially into your SQLite database, they will be stored sequentially.
From the Documentation: the automatically generated ROWIDs are guaranteed to be monotonically increasing.
So, for example, if you wanted to have a table for Person, then you could use the following command to create table with autoincrement.
CREATE table PERSON (personID integer PRIMARY KEY AUTOINCREMENT, personName string)
Link: http://www.sqlite.org/autoinc.html