Locking ID before doing an insert - Oracle 10G - oracle10g

I have a table whose primary keys are numbers are not sequentially.
By company policy is to register the new rows with ID lower value available. I.E.
table.ID = [11,13,14,16,17]
min(table.ID) = 12
I have an algorithm that gives me the lowest available. I want to know how to prevent this ID is use by another person before making insertion.
Would it be possible to do by DB? or would it be programming language?
Thanks.

The company policy is extremely short-sighted. Unless the company's goal is to build applications that do not scale and the company is unconcerned with performance.
If you really wanted to do this, you'd need to serialize all your transactions that touch this table-- essentially turning your nice, powerful server into a single-threaded single-user low-end machine. There are any number of ways to do this. The simplest (though not simple) method would be to do a SELECT ... FOR UPDATE on the row with the largest key less than the new key you want to insert (11 in this case). Once you acquired the lock, you would need to re-confirm that 12 is vacant. If it is, you could then insert the row with an id of 12. Otherwise, you'd need to restart the process looking for the new key and trying to lock the row with an id one less than that key. When your transaction commits, the lock would be released and the next session that was blocked waiting for a lock would be able to process. This assumes that you can control every process that tries to insert data into this table and that they would all implement exactly the same logic. It will lock up the system if you ever allow transactions to span waits for human input because humans will inevitably go to lunch with rows locked. And all that serialization will radically reduce the scalability of your application.
I would strongly encourage you to push back against the ridiculous "requirement" rather than implementing something this hideous.

Related

Simulating an Oracle sequence with MongoDB

Our domain model deals with sales invoices, each of which has a unique, automatically generated number. When creating an invoice, our SalesInvoiceService retrieves a number from a SalesInvoiceNumberGenerator, creates a SalesInvoice using this number and a few other objects (seller, buyer, issue date, etc.) and stores it through the SalesInvoiceRepository. Since we are using MongoDB as our database, our MongoDbSalesInvoiceNumberGenerator uses a findAndModify command with $inc 1 on a given InvoicePolicies.nextSalesInvoiceNumber to generate this unique number, similar to what we would using an Oracle sequence.
This is working in normal situations. However, when invoice creation fails because of a broken business rule (e.g. invalid issue date), an exception is thrown and our InvoicePolicies.nextSalesInvoiceNumber has alreay been incremented. Obviously, since there is no transaction managing this unit of work, this increment is not rolled back, so we end up with lost invoice numbers. We do offer a manual compensation mechanism to the user, but we would like to avoid this sort of situation in the first place.
How would you deal with this situation? And no, switching to another database is not option :)
Thanks!
TL;DR: What you want is strict serializability, but you probably won't get it, unless you give up concurrency completely (then you even get linearizability, theoretically). Gap-free is easy, but making sure that today's invoice doesn't get a lower number than yesterdays is practically impossible.
This is tricky, or at least, very expensive. That is also true for any other data store, because you'll have to limit the concurrency of the application to guarantee it. Think of an auto-increasing stamp that is passed around in an office, but some office workers lose letters. Tricky... But you can reduce the likelihood.
Generating sequences without gaps is hard when contention is high, and very hard in a distributed system. Keeping a lock for the entire time the invoice is generated is usually not an option, though that would be easy. So let's try that:
Easiest way out: Use a singleton background worker, i.e. a single-threaded process that runs on a single machine. Have it explicitly check whether the current number is really present in the invoice collection. Because it's single-threaded on a single machine, it can't have race conditions. Done, via limiting concurrency.
When allowing concurrency, things get messy:
It might be best to use something like a two-phase commit protocol. Essentially, make the entire invoice creation process a long-running transaction, and store the pending transactions explicitly, i.e. store all numbers that haven't been used yet, but reserved.
Then track the completion status of each and every transaction. If a transaction hasn't finished after some timeout, consider that number available again. It's hard enough to add that to the counter code, but it's possible (check if a timed out transaction is present, otherwise get a new counter value).
There are several possible errors, but they can all be resolved. This is better explained in the link and on the net. Generally, getting the implementation right is hard though.
The timeout poses a problem, however, because you need to hard-code an assumption about the time it takes for invoices to be generated. That can be awkward close to day/month/year barriers, since you'll want to avoid creating invoice 12345 in 2015 and 12344 in 2014.
Even this won't guarantee gap free numbers for limited time intervals: if no more request is made that could use the gap number in the current year, you're facing a problem.
I wonder if using something like findAndModify and the new Transactions API combined could be used to achieve something like that while also accounting for gaps if ran within a transaction then? I haven't personally tried it, and my project isn't far along yet to worry about the billing system but would love to be able to use the same database for everything to make things a bit easier to operate.
One problem I would think is probably a write bottleneck but this should only take a few milliseconds I'd imagine and you could probably use a different counter for every jurisdiction or store like real life stores do. Then the cash register number could be part of it too, which I guess guess cash register numbers in the digital world could be the transaction processing server it went to if say you used microservices for example, so you could load balance round robin between them probably. That's assuming if it's uses a per document lock - which from my understanding it does possibly.
The only main time I'd probably worry about this bottleneck is if you had a very popular store or around black Friday where there's a huge spike or doing recurring invoices.

mongo save documents in monotically increasing sequence

I know mongo docs provide a way to simulate auto_increment.
http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/
But it is not concurrency-proof as guaranteed by say MySQL.
Consider the sequence of events:
client 1 obtains an index of 1
client 2 obtains an index of 2
client 2 saves doc with id=2
client 1 saves doc with id=1
In this case, it is possible to save a doc with id less than the current max that is already saved. For MySql, this can never happen since auto increment id is assigned by the server.
How do I prevent this? One way is to do optimistic looping at each client, but for many clients, this will result in heavy contention. Any other better way?
The use case for this is to ensure id is "forward-only". This is important for say a chat room where many messages are posted, and messages are paginated, I do not want new messages to be inserted in a previous page.
But it is not concurrency-proof as guaranteed by say MySQL.
That depends on the definition of concurrency-proof, but let's see
In this case, it is possible to save a doc with id less than the current max that is already saved.
That is correct, but it depends on the definition of simultaneity and monotonicity. Let's say your code snapshots the state of some other part of the system, then fetches the monotonic key, then performs an insert that may take a while. In that case, this apparently non-monotonic insert might actually be 'more monotonic' in the sense that index 2 was indeed captured at a later time, possibly reflecting a more recent state. In other words: does the time it took to insert really matter?
For MySql, this can never happen since auto increment id is assigned by the server.
That sounds like folklore. Most relational dbs offer fine-grained control over these features, since strict guarantees severely impact concurrency.
MySQL does neither guarantee that there are no gaps, nor that a transaction with a high AUTO_INCREMENT id isn't visible to other readers before a transaction that acquired a lower AUTO_INCREMENT value was committed, unless you keep a table-level lock, which severely impacts concurrency.
For gaplessness, consider a transaction rollback of the first of two concurrent inserts. Does the second insert now get a new id assigned while it's being committed? No - from the InnoDB documentation:
You may see gaps in the sequence of values assigned to the AUTO_INCREMENT column if you roll back transactions that have generated numbers using the counter. (see end of 14.6.5.5.1, "Traditional InnoDB Auto-Increment Locking")
and
In all lock modes (0, 1, and 2), if a transaction that generated auto-increment values rolls back, those auto-increment values are “lost”
also, you're completely ignoring the problem of replication where sequences lead to even more trouble:
Thus, table-level locks held until the end of a statement make INSERT statements using auto-increment safe for use with statement-based replication. However, those locks limit concurrency and scalability when multiple transactions are executing insert statements at the same time. (see 14.6.5.5.2 "Configurable InnoDB Auto-Increment Locking")
The sheer length of the documentation of the InnoDB behavior is a reminder of the true complexity of making apparently simple guarantees in a concurrent system. Yes, monotonicity of inserts is possible with table-level locks, but hardly desirable. If you take a distributed view of the system, things get worse, because we can't even be sure of the counter value in partition mode...

anything wrong about having MANY sequences in postgres?

I am developing an application using a virtual private database pattern in postgres.
So every user gets his id and all rows of this user will hold this id to be separated from others. this id should also be part of the primary key. In addition every row has to have a id which is unique in the scope of the user. This id will be the other part of the primary key.
If we have to scale this across multiple servers we can also append a third column to the pk identifying the shard this id was generated at.
My question now is how to create per user unique ids. I came along with some options which i am not sure about all the implications. The 2 solutions that seem most promising to me are:
creating one sequence per user:
this can be done automatically, using a trigger, every time a user is created. This is for sure transaction safe and I think it should be quite ok in terms of performance.
What I am worried about is that this has to work for a lot of users (100k+) and I don't know how postgres will deal with 100k+ sequences. I tried to find out how sequences are implemented but without luck.
counter in user table:
keep all users in a table with a field holding the latest id given for this user.
when a user starts a transaction I can lock the row in the user table and create a temp sequence with the latest id from the user table as a starting value. this sequence can then be used to supply ids for new entries.
before exiting the transaction the current value has to be written back to the user table and the lock has to be released.
If another transaction from the same user tries to concurrently insert rows it will stall until the first transaction releases its lock on the user table.
This way I do not need thousands of sequences and i don't think that ther will be concurrent accesses from one user frequently (the application has oltp character - so there will not be long lasting transactions) and even if this happens it will just stall for about a second and not hurt anything.
The second part of my question is if I should just use 2 columns (or maybe three if the shard_id joins the game) and make them a composite pk or if I should put them together in one column. I think handling will be way easier having them in separate columns but what does performance look like? Lets assume both values are 32bit integers - is it better tho have 2 int columns in an index or 1 bigint column?
thx for all answers,
alex
I do not think sequences would be scalable to the level you want (100k sequences). A sequence is implemented as a relation with just one row in it.
Each sequence will appear in the system catalog (pg_class) which also contains all of the tables, views, etc. Having 100k rows there is sure to slow the system down dramatically. The amount of memory required to hold all of the data structures associated with these sequence relations would be also be large.
Your second idea might be more practical, if combined with temporary sequences, might be more scalable.
For your second question, I don't think a composite key would be any worse than a single column key, so I would go with whatever matches your functional needs.

Concurrent processes working on a PostgreSQL table

I have a simple procedure where I need to process records of a table, and ideally run multiple instances of the process without processing the same record. The way I've done this with MySQL is fairly common (although I perceive the token field to be more of a hack):
Adding a couple of fields to the table:
CREATE TABLE records (
id INTEGER PRIMARY KEY AUTO_INCREMENT,
...actual fields...
processed_at DATETIME DEFAULT NULL,
process_token TEXT DEFAULT NULL
);
And then a simple processing script:
process_salt = md5(rand()) # or something like a process id
def get_record():
token = md5(microtime + process_salt)
db.exec("UPDATE records SET process_token = ?
WHERE processed_at IS NULL LIMIT 1", token)
return db.exec("SELECT * FROM records WHERE token = ?", token)
while (row = get_record()) is valid:
# ...do processing on row...
db.exec("UPDATE records SET processed_at = NOW(), token = NULL
WHERE id = ?", row.id)
I'm implementing such a process in a system which uses a PostgreSQL database. I know Pg could be considered more mature than MySQL with regards to locking thanks to MVCC - can I use row-locking or some other feature in Pg instead of the token field?
This approach will work with PostgreSQL but it'll tend to be pretty inefficient as you're updating each row twice - each update requires two transactions, two commits. The cost of this can be mitigated somewhat by using a commit_delay and possibly disabling synchronous_commit, but it's still not going to be fast unless you have a non-volatile write-back cache on your storage subsystem.
More importantly, because you're committing the first update there is no way to tell the difference between a worker that's still working on the job and a worker that has crashed. You could probably set the token to the worker's process ID if all workers are on the local machine then scan for missing PIDs occasionally but that's cumbersome and race-condition prone, not to mention the problems with pid re-use.
I would recommend that you adopt a real queuing solution that is designed to solve these problems, like ActiveMQ, RabbitMQ, ZeroMQ, etc. PGQ may also be of significant interest.
Doing queue processing in a transactional relational database should be easy, but in practice it's ridiculously hard to do well and get right. Most of the "solutions" that look sensible at a glance turn out to actually serialize all work (so only one of many queue workers is doing anything at any given time) when examined in detail.
You can use SELECT ... FOR UPDATE NOWAIT which will obtain an exclusive lock on the row, or report an error if it is already locked.

What is better in terms of sqlite3 performance: delete unneeded row or set it as not needed?

I am writing an iPhone application where the user receives multiple messages from different users. These messages are stored in an sqlite3 database. With time the user might like to delete received messages from one user, but for sure he will continue to receive new messages from that user after deleting the old ones.
Since retrieving the messages will be done using a SELECT statement, which scenario is better to use when the user would like to delete the messages (in terms of performance):
DELETE all the old messages normally and continue to retrieve the new ones using a statement like: SELECT Messages FROM TableName WHERE UserID = (?)
Add a field to the table of type INTEGER and upon the DELETE request set this field to 1 and after that retrieve the new messages using a statement like: SELECT Messages FROM TableName WHERE UserID = (?) AND IsDeleted = 0
One more thing, if scenario 1 is used (normal DELETE) will this cause any fragmentation of the database file on the disk?
Many thanks in advance.
Using scenario 1 is much better, since both SELECT and DELETE in SQL operate at the same level of speed and scenario 1 will grant you not having dangling tuples (Unwanted Rows) in your database.
If you are wishing to perform data backup after any deletion process so scenario 2 is a must but you have to take into consideration the growing size of your database which leads to a slower performance in future.
Finally I would like to add that performing deleting operations on a database would not cause any fragmentation issues since most of databases have fragmentation and optimizing tools in their engines.
It would be a pretty lousy database if DELETE didn't work well. In absence of evidence to the contrary, I'd assume you are safe to delete as normal. The database's entire raison d'être is to make these sorts of operations efficient.
IMHO if you don't use DELETE, after a while the db will get bigger and bigger, thus making each SELECT less and less efficient.
therefore i figure that deleting rows that will never be used again is more efficient.