This is not specific to any language. Just want to know the "lock contention" terms better.
Let us say: we want to execute two kinds of commands using ARM instructions
We have 10 stores, each sells product, there is one central boss holding a ledger recording the total product sold.
If we are using exclusive, my image is:
the 10 stores say I want to update the sum, the BOSS gives the ledger to one of the stores, he adds the products sold to the sum.
Then the boss gives the ledger to the next store. they all take turns so the sum is not screwed up.
If we are using atomic, my image is:
10 stores just write a letter to the BOSS about how many products they sold, the BOSS will do the sum by himself one by one. There is no permission granted to any store etc.
The sum is just doing AtomicAdd for 10 stores.
So in this case, what is the lock to be contended with?
The atomic is not handling the concepts that you are looking at. Lets say one of the stores has a very large product count and a high volume count. The store is in Gotham City and has lots of customers and employees. The regional boss yells 'STOP' to add up the ledgers. Unfortunately, the employee updating the Gotham City ledge was changing 999,999 to 1,000,000 and the number reads as 1,000,999.
If the regional boss can not yell 'STOP', then employees at other stores can be altering the count while the inventory is being computed. If stores exchange inventory, this can lead to inaccuracy. Also, you will never know the actual sales at some time as between the first store and last store count to be accessed, you have sales at all the stores which will alter the accuracy of the sum.
Now, lets say there are regional bosses and a national boss. The national boss tries two ways to calculate all of the sales. She sums the count for all the regional bosses and conducts a full count of all the stores. These numbers will not be equal unless she yells 'STOP' to ensure no sales are made and each regional managers can conduct a regional count. She may conduct the total national count at the same time as the underlying process has stopped. Unfortunately, the business goes bankrupt due to the time this takes and the lost sales it created.
The business reforms and they decide to count only one region at a time. Either a regional boss or a national boss may yell 'STOP' for that region. The lock contention happens when two bosses yell 'STOP' during the same time. Also, how do the sales restart? A boss has to yell 'GO'. If two bosses are doing a sales count at the same time, the stores will start reselling as soon as the first boss to finish yells 'GO'. This leads to inaccurate inventory counts (as well as counting only one local region at a time) and the business goes bankrupt because they do not have product to sell and customers get frustrated and go elsewhere.
The company reforms and they decide that the sales people will update a regional count and a national count every time a sale is made. As the national manager has seen some problems before, she thinks a long time about how the sales people will update these counts. They buy a machine that will add one to the counts every time a button is pressed. As she knows that if the employee has to look at the count, the regional and national number may change as they are calculating the new number.
The 2nd case has lock contention issues. Locks are yelling 'STOP' and 'GO' and you have contention when multiple actors/processes want to use that lock at the same time. All cases an issue with atomic. This is the ability to update information so that it is consistent when other read (or read update) the value. In the last case, the machine gave a 'lock-free' primitive to the business so they can update the count atomically without needing a lock.
The locks are needed when you have calculations that involve multiple atomic values or mixtures of reading and writing. If the value is not atomic to begin with, then even the locks won't work.
Related
Trying to design database. I have User table, which supposed to have balance field. It can be calculated each time I need balance based on 4 tables like below:
So I need to do some math like sum of all deposits, minus sum of withdrawals, minus bet amount+profit, plus amount from bonuses. There can be thousands of records for each user in each related table.
Or alternatively I can just update balance field from the application code whenever one of the related table has been altered.
However, first method is tend to be slower and slower as database grows. Second method is prone to errors, in case my application will fail to update the field and real balance will get asynced with balance field.
I wonder if there is any design pattern or technique to handle this cases? How the online bankings or similar services are counting balance? Are they going through each bank transaction every time balance is requested?
For the first method , in order to prevent from loading all transaction to calculate the account balance , we can make a snapshot value of the account balance periodically (e.g. at the end of day/month or for certain number of completed transactions) such that when calculating the latest account balance , we only need to load latest snapshot account balance and only the transaction after that latest snapshot time rather than loading all transaction records.
You can also found the similar snapshot pattern in the event-sourcing
Situation
I am trying to implement a warehouse system using a traditional database.
The tables are :
products (each row representing 1 sku)
warehouse_locations ( each row represents a particular shelf in a particular warehouse)
pallets (each row represents a particular pallet)
user_defined_categories (each row represents a particular user defined category: e.g. reserved, available, total_physical, etc)
products_in_pallets_by_categories (each row will have foreign keys of the pallets, products, and user_defined_categories table. will specify quantity of products in a particular pallet of a particular category.)
products_in_warehouse_locations_by_categories (each row will have foreign keys of the warehouse_locations, products table, and user_defined_categories. will specify quantity of products in a particular pallet of a particular category.)
What end users want to see/do
End users will update the system about what products are placed/removed on what pallet.
End users will also want to know any time (preferably in real-time) how many reserved or available products are in the warehouse.
So what's my initial plan?
Wanted to use a traditional RDBMS like PostgresQL and a message queue like RabbitMQ to provide real-time updates. By real-time updates, I mean the end users using either a single page application or mobile phone can observe changes in inventory in real-time.
So what's changed?
I came across rethinkdb FAQ and it said
RethinkDB is not a good choice if you need full ACID support or strong
schema enforcement—in this case you are better off using a relational
database such as MySQL or PostgreSQL.
Why you even considering rethinkdb?
Because if I can use it and it allows real-time updates, it will help tremendously as we expect the client's sales team placing reservations around the world on our system.
What's the most frequent updates/inserts?
The movement of the products from one place to another. I expect plenty of updates/inserts/deletes to the relation tables. Occasionally, I apologise I do not know how to explain this in the rethinkdb paradigm. I am a traditional RDBMS person.
Is the system built yet?
Not yet. Which is why I want to seek an answer regarding rethinkdb before actually proceeding.
Do you expect to use any transactions?
Well, I am not sure.
I can think of a real world case where a warehouse worker moves products (partially or completely) from one pallet to another pallet.
Another real world case will be where a warehouse worker moves the products from a pallet to a warehouse_location (or vice-versa).
Do I definitely need to use transactions? Again, I am not sure.
Cause I expect the workers to update the system AFTER they have physically finished the moving.
I will provide a screen for them to choose
move from <some dropdown> to <another dropdown>
So what's the question?
Do I need to have full ACID support or strong schema enforcement for my warehouse system based on my user requirements at the moment? And is it implementable using rethinkdb?
I also expect to implement activity streams once the system is implemented which will show events such as Worker A moved 100 units of product A from warehouse shelf 1A to pallet 25.
When you are dealing with things where information must always be accurate and consistent, ACID matters. From what you say, it sounds like it is important.
It sounds to me like you want to just allow real-time updates and that the key problem is in seeing rabbit-mq as the non-real-time component, correct? Why were you even considering RabbitMQ? (If it is to allow the db to go down for maintenance, maybe implement a backup private caching store in sqlite?)
In general you should assume you need ACID compliance until you have a model where eventual consistency is ok. Additionally real-time accurate reporting rules out eventual consistency.
Our domain model deals with sales invoices, each of which has a unique, automatically generated number. When creating an invoice, our SalesInvoiceService retrieves a number from a SalesInvoiceNumberGenerator, creates a SalesInvoice using this number and a few other objects (seller, buyer, issue date, etc.) and stores it through the SalesInvoiceRepository. Since we are using MongoDB as our database, our MongoDbSalesInvoiceNumberGenerator uses a findAndModify command with $inc 1 on a given InvoicePolicies.nextSalesInvoiceNumber to generate this unique number, similar to what we would using an Oracle sequence.
This is working in normal situations. However, when invoice creation fails because of a broken business rule (e.g. invalid issue date), an exception is thrown and our InvoicePolicies.nextSalesInvoiceNumber has alreay been incremented. Obviously, since there is no transaction managing this unit of work, this increment is not rolled back, so we end up with lost invoice numbers. We do offer a manual compensation mechanism to the user, but we would like to avoid this sort of situation in the first place.
How would you deal with this situation? And no, switching to another database is not option :)
Thanks!
TL;DR: What you want is strict serializability, but you probably won't get it, unless you give up concurrency completely (then you even get linearizability, theoretically). Gap-free is easy, but making sure that today's invoice doesn't get a lower number than yesterdays is practically impossible.
This is tricky, or at least, very expensive. That is also true for any other data store, because you'll have to limit the concurrency of the application to guarantee it. Think of an auto-increasing stamp that is passed around in an office, but some office workers lose letters. Tricky... But you can reduce the likelihood.
Generating sequences without gaps is hard when contention is high, and very hard in a distributed system. Keeping a lock for the entire time the invoice is generated is usually not an option, though that would be easy. So let's try that:
Easiest way out: Use a singleton background worker, i.e. a single-threaded process that runs on a single machine. Have it explicitly check whether the current number is really present in the invoice collection. Because it's single-threaded on a single machine, it can't have race conditions. Done, via limiting concurrency.
When allowing concurrency, things get messy:
It might be best to use something like a two-phase commit protocol. Essentially, make the entire invoice creation process a long-running transaction, and store the pending transactions explicitly, i.e. store all numbers that haven't been used yet, but reserved.
Then track the completion status of each and every transaction. If a transaction hasn't finished after some timeout, consider that number available again. It's hard enough to add that to the counter code, but it's possible (check if a timed out transaction is present, otherwise get a new counter value).
There are several possible errors, but they can all be resolved. This is better explained in the link and on the net. Generally, getting the implementation right is hard though.
The timeout poses a problem, however, because you need to hard-code an assumption about the time it takes for invoices to be generated. That can be awkward close to day/month/year barriers, since you'll want to avoid creating invoice 12345 in 2015 and 12344 in 2014.
Even this won't guarantee gap free numbers for limited time intervals: if no more request is made that could use the gap number in the current year, you're facing a problem.
I wonder if using something like findAndModify and the new Transactions API combined could be used to achieve something like that while also accounting for gaps if ran within a transaction then? I haven't personally tried it, and my project isn't far along yet to worry about the billing system but would love to be able to use the same database for everything to make things a bit easier to operate.
One problem I would think is probably a write bottleneck but this should only take a few milliseconds I'd imagine and you could probably use a different counter for every jurisdiction or store like real life stores do. Then the cash register number could be part of it too, which I guess guess cash register numbers in the digital world could be the transaction processing server it went to if say you used microservices for example, so you could load balance round robin between them probably. That's assuming if it's uses a per document lock - which from my understanding it does possibly.
The only main time I'd probably worry about this bottleneck is if you had a very popular store or around black Friday where there's a huge spike or doing recurring invoices.
I'm using MongoDB with approximately 4 million documents and around 5-6GB database size. The machine has 10GB of RAM, and free only reports around 3.7GB in use. The database is used for a video game related ladder (rankings) website, separated by region.
It's a fairly write heavy operation, but still gets a significant number of reads as well. We use an updater which queries an outside source every hour or two. This updater then processes the records and updates documents on the database. The updater only processes one region at a time (see previous paragraph), so approximately 33% of the database is updated.
When the updater runs, and for the duration that it runs, the average flush time spikes up to around 35-40 seconds, and we experience general slowdowns with other queries. The updater is RAN on a SEPARATE MACHINE and only queries MongoDB at the end, when all the data has been retrieved and processed from the third party.
Some people have suggested slowing down the number of updates, or only updating players who have changed, but the problem comes down to rankings. Since we support ties between players, we need to pre-calculate the ranks - so if only a few users have actually changed ranks, we still need to update the rest of the users ranks accordingly. At least, that was the case with MySQL - I'm not sure if there is a good solution with MongoDB for ranking ~800K->1.2 million documents while supporting ties.
My question is: how can we improve the flush and slowdown we're experiencing? Why is it spiking so high? Would disabling journaling (to take some load off the i/o) help, as data loss isn't something I'm worried about as the database is updated frequently regardless?
Server status: http://pastebin.com/w1ETfPWs
You are using the wrong tool for the job. MongoDB isn't designed for ranking large ladders in real time, at least not quickly.
Use something like Redis, Redis have something called a "Sorted List" designed just for this job, with it you can have 100 millions entries and still fetch the 5000000th to 5001000th at sub millisecond speed.
From the official site (Redis - Sorted sets):
Sorted sets
With sorted sets you can add, remove, or update elements in a very fast way (in a time proportional to the logarithm of the number of elements). Since elements are taken in order and not ordered afterwards, you can also get ranges by score or by rank (position) in a very fast way. Accessing the middle of a sorted set is also very fast, so you can use Sorted Sets as a smart list of non repeating elements where you can quickly access everything you need: elements in order, fast existence test, fast access to elements in the middle!
In short with sorted sets you can do a lot of tasks with great performance that are really hard to model in other kind of databases.
With Sorted Sets you can:
Take a leader board in a massive online game, where every time a new score is submitted you update it using ZADD. You can easily take the top users using ZRANGE, you can also, given an user name, return its rank in the listing using ZRANK. Using ZRANK and ZRANGE together you can show users with a score similar to a given user. All very quickly.
Sorted Sets are often used in order to index data that is stored inside Redis. For instance if you have many hashes representing users, you can use a sorted set with elements having the age of the user as the score and the ID of the user as the value. So using ZRANGEBYSCORE it will be trivial and fast to retrieve all the users with a given interval of ages.
Sorted Sets are probably the most advanced Redis data types, so take some time to check the full list of Sorted Set commands to discover what you can do with Redis!
Without seeing any disk statistics, I am of the opinion that you are saturating your disks.
This can be checked with iostat -xmt 2, and checking the %util column.
Please don't disable journalling - you will only cause more issues later down the line when your machine crashes.
Separating collections will have no effect. Separating databases may, but if you're IO bound, this will do nothing to help you.
Options
If I am correct, and your disks are saturated, adding more disks in a RAID 10 configuration will vastly help performance and durability - more so if you separate the journal off to an SSD.
Assuming that this machine is a single server, you can setup a replicaset and send your read queries there. This should help you a fair bit, but not as much as the disks.
1) My first questions is regarding the best solution to store statistics with mongoDB
If i want to store large amounts of statistics (lets say visitors on a specific site - down to hourly), a noSQL DB like mongoDB seems to work very fine. But how do I structure those tables to get the most out of mongoDB?
I'd increase the visitor amount for that specific object id (for example SITE_MONTH_DAY_YEAR_SOMEOTHERFANCYPARAMETER) by one every time a user visits the page. But if the database gets big (>10g), doesnt that slow down (like it would on mysql) because it has to search for the object_id and update it? Is the data always accurate when I update it (afaik mongoDB does not have any table locking?)
Wouldnt it be faster just INSERTING one row for every visitor? (and more accurate) On the other hand, reading the statistics would be much faster with my first solution, wouldnt it? (especially in terms of "grouping" by site/date/[...]).
2) For every visitor counted I'd like to make a money transfer between two users. It is crucial that those transfers are always accurate. How would you achieve that?
I was thinking about a hourly cron that picks the amount of vistiors from the mongoDB.statistics for the last hour and updates the users balance. I'd prefer doing this directly/live while counting the user - but what happens if thousands of visitors are calling the script simultaneously, is there any risk of getting wrong balances?