How do I model a queue on top of a key-value store efficiently? - queue

Supposed I have a key-value database, and I need to build a queue on top of it. How could I achieve this without getting a bad performance?
One idea might be to store the queue inside an array, and simply store the array using a fixed key. This is a quite simple implementation, but is very slow, as for every read or write access the complete array must be loaded / saved.
I could also implement a linked list, with random keys, and there is one fixed key which acts as starting point to element 1. Depending on if I prefer a fast read or a fast write access, I could let point the fixed element to the first or the last entry in the queue (so I have to travel it forward / backward).
Or, to proceed with that - I could also have two fixed pointers: One for the first, on for the last item.
Any other suggestions on how to do this effectively?

Initially, key-value structure is extremely similar to the original memory storage where the physical address in computer memory plays as the key. So any type of data structure could be modeled upon key-value storage surely, including linked list.
Originally, a linked list is a list of nodes including the index information of previous node or following node. Then the node it self should also be viewed as a sub key-value structure. With additional prefix to the key, the information in the node could be separately stored in a flat table of key-value pairs.
To proceed with that, special suffix to the key could also make it possible to get rid of redundant pointer information. This pretend list might look something like this:
pilot-last-index: 5
pilot-0: Rei Ayanami
pilot-1: Shinji Ikari
pilot-2: Soryu Asuka Langley
pilot-3: Touji Suzuhara
pilot-5: Makinami Mari
The corresponding algrithm is also imaginable, I think. If you could have a daemon thread for manipulation these keys, pilot-5 could be renamed as pilot-4 in the above example. Even though, it is not allowed to have additional thread in some special situation, the result of the queue it self is not affected. Just some overhead would exist for the break point in sequence.
However which of the two above should be applied is the problem of balance between the cost of storage space or the overhead of CPU time.
The thread safe is exactly a problem however an ancient problem. Just like the class implementing the interface of ConcurrentMap in JDK, Atomic operation on key-value data is also provided perfectly. There are similar methods featured in some key-value middleware, like memcached, as well, which could make you update key or value separately and thread safely. However these implementation is the algrithm problem rather than the key-value structure it self.

I think it depends on the kind of queue you want to implement, and no solution will be perfect because a key-value store is not the right data structure for this kind of task. There will be always some kind of hack involved.
For a simple first in first out queue you could use a few kev-value stores like the folliwing:
{
oldestIndex:5,
newestIndex:10
}
In this example there would be 6 items in the Queue (5,6,7,8,9,10). Item 0 to 4 are already done whereas there is no Item 11 or so for now. The producer worker would increment newestIndex and save his item under the key 11. The consumer takes the item under the key 5 and increments oldestIndex.
Note that this approach can lead to problems if you have multiple consumer/producers and if the queue is never empty so you cant reset the index.
But the multithreading problem is also true for linked lists etc.

Related

How to model very large work queues in Akka?

I am writing a scala script to download all items from the hacker news API. There are ~12M items, each being a JSON of ~200 bytes.
I identified the following issues:
Storing the data: I tried to save each item as a single JSON file, but it became very hard just to barely list them (using Linux, ext4 file system). So I changed it to just append JSON items to multiple (100) files (by taking the item's id module 100).
Keeping track of what has been downloaded, because I want to be able to stop/continue the application. First I tried writing the downloaded ids to a textfile, but it turned out a little bit buggy. So now I just read all the items and collect the ids. (It works.)
All this is done with 1 Master actor and an arbitrary number of Worker actors (tens). The Master has a Queue[Int] and pops it and Workers ask for work.
The problem I am having is fairly simple but I haven't been able to solve it in a nice way.
I can collect the ids from items already downloaded in a list. But what I really need is the complement to that set; I need all the items I have not downloaded, up to the highest item id.
I tried using a range (1 to maxItemId) and subtracting the set of done jobs but it is really slow. reaaaaaaally slow.
Now I am using a Stream, and when a worker asks for a job, I check if the stream's (the next job) has already been done. If so, I give it to the Worker. Otherwise I check the next one.
The problem with this approach is that I can not put jobs back at the stream if they fail. That would be easy with the Queue; but then again I am having trouble just setting up the queue with millions of items.
What could be a better approach to this? I don't think the issues here are trivial, this is a very large number of tasks to perform and keep track of, but it shouldn't be so hard as well.
Thanks!
As far as I understood your question, I think you don't need a very complicated data structure here.
Assuming your ids are sequential from 1 to maxItemId, you can use an array of Boolean with maxItemId size to keep track of processed items. You initialize this array by reading the processed ids. And you find the next job by searching for the next false entry.
Assuming that your maxItemId is around 12M, iterating over all items is pretty much instantaneous.

Detecting concurrent data modification of document between read and write

I'm interested in a scenario where a document is fetched from the database, some computations are run based on some external conditions, one of the fields of the document gets updated and then the document gets saved, all in a system that might have concurrent threads accessing the DB.
To make it easier to understand, here's a very simplistic example. Suppose I have the following document:
{
...
items_average: 1234,
last_10_items: [10,2187,2133, ...]
...
}
Suppose a new item (X) comes in, five things will need to be done:
read the document from the DB
remove the first (oldest) item in the last_10_items
add X to the end of the array
re-compute the average* and save it in items_average.
write the document to the DB
* NOTE: the average computation was chosen as a very simple example, but the question should take into account more complex operations based on data existing in the document and on new data (i.e. not something solvable with the $inc operator)
This certainly is something easy to implement in a single-threaded system, but in a concurrent system, if 2 threads would like to follow the above steps, inconsistencies might occur since both will update the last_10_items and items_average values without considering and/or overwriting the concurrent changes.
So, my question is how can such a scenario be handled? Is there a way to check or react-upon the fact that the underlying document was changed between steps 1 and 5? Is there such a thing as WATCH from redis or 'Concurrent Modification Error' from relational DBs?
Thanks
In database system,it uses a memory inspection and roll back scheme which is similar to transactional memory.
Briefly speaking, it simply monitors the share memory parts you specified and do something like compare and swap or load and link or test and set.
Therefore,if any memory content is changed during transaction,it will abort and try again until there is no conflict operation for that shared memory.
For example,GCC implements the following:
https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
type __sync_lock_test_and_set (type *ptr, type value, ...)
type __sync_val_compare_and_swap (type *ptr, type oldval type newval, ...)
For more info about transactional memory,
http://en.wikipedia.org/wiki/Software_transactional_memory

How can I get the index of an item in an IOrderedQueryable?

Background:
I'm designing a list-like control (WinForms) that's backed by a DbSet. A chief requirement is that it doesn't load the entire list into local memory. I'm using a DataGridView in virtual mode as the underlying UI. I'm planning to implement the CellValueNeeded function as orderedQueryable.ElementAt(n).
Problem:
I need to allow the control's consumer to get/set the currently-selected value, by value rather than by index. Getting is easy--it's the same as the CellValueNeeded operation--but setting is harder: it requires me to get the index of a given element. There's not a built-in orderedQueryable.FirstIndexOf(value) operation, and although I could theoretically fake it with some sort of orderedQueryable.SkipWhile shenanigans where the expression has a side-effect, in practice the DbSet's query provider probably doesn't support doing that.
Questions:
Is there an efficient way to get the index of a particular value within an IOrderedQueryable? How?
(If this approach turns out to be untenable, I'd settle for suggestions on how I might restructure the problem to make it solvable.)
Side notes:
Elements can be inserted and removed from the list, in which case the old indices will be invalid--that's acceptable, since they're never exposed to the consumer. It's an error for the consumer to attempt to select an item that isn't actually in the list, and actually the consumer would have gotten the item from the list in the first place (although perhaps the indices have changed since then).

How can I implement incr/decr on top of a key/value store?

How can I implement incr/decr on top of a key/value store?
I'm using a key value store that doesn't support incr and decr though which is why I want to create this. I have used Redis and Memcached incr and decr, so as mentioned in some of the answers then this is a perfect example of how I want the incr and decr to behave, so thanks to those who mentioned this.
The point of having a incr() function is it's all internal to the store. You don't have to pull data out and push it back in.
What you're doing sounds like you want to put some logic in your code that pulls the data out, increments it and pushes it back in... While it's not very hard (I think I've just described how you'd do it), it does defeat the point somewhat.
To get the benefit you'd need to change the source of your key store. Might be easy.
But a lot of caches already have this. If you really need this for speed, perhaps you should find an alternate store like memcached that does support it.
Memcache has this functionality built in
edit: it looks like you're not going to get an atomic update without updating the source, as there doesn't appear to be a lock function. If there is (and this is not pretty), you can lock the value, get it, increment it in your application, put it, and unlock it. Suboptimal though.
it kind of seems like without a compareAndSet then you are out of luck. But it will help to consider the problem from another angle. For example, if you were implementing an atomic counter that shows the number of upvotes for a question, then one way would be to have a "table" per question and to put a +1 for each upvote and -1 for each downvote. Then to "get" you would sum the "table". For this to work I assume "tables" are inexpensive and you don't care how long "get" takes to compute, you only mentioned incr/decr.
If you wish to atomically increment or decrement an int value associated with a key of e.g. type string, and if you'll know all of the keys in advance of having to perform the atomic operations on any of them, use Dictionary<string, int[]> and pre-populate the dictionary with a single-item array for each key value. It will then be possible to perform atomic operations (e.g. increment) on items via code like Threading.Interlocked.Increment(MyDict[keyString][0]);. If you need to be able to deal with keys that are not known in advance, you may need to use a ConcurrentDictionary instead of Dictionary, but you need to be careful if two threads try to simultaneously create dictionary entries for the same key.
Since increment and decrement are simple addition and subtraction operations that are "commutative", what you need to implement is a PN-Counter. It is a CRDT (commutative replicated data type). Various examples of how to implement this on Riak are available around the web and on Github.

What is the most practical Solution to Data Management using SQLite on the iPhone?

I'm developing an iPhone application and am new to Objective-C as well as SQLite. That being said, I have been struggling w/ designing a practical data management solution that is worthy of existing. Any help would be greatly appreciated.
Here's the deal:
The majority of the data my application interacts with is stored in five tables in the local SQLite database. Each table has a corresponding Class which handles initialization, hydration, dehydration, deletion, etc. for each object/row in the corresponding table. Whenever the application loads, it populates five NSMutableArrays (one for each type of object). In addition to a Primary Key, each object instance always has an ID attribute available, regardless of hydration state. In most cases it is a UUID which I can then easily reference.
Before a few days ago, I would simply access the objects via these arrays by tracking down their UUID. I would then proceed to hydrate/dehydrate them as I needed. However, some of the objects I have also maintain their own arrays which reference other object's UUIDs. In the event that I must track down one of these "child" objects via it's UUID, it becomes a bit more difficult.
In order to avoid having to enumerate through one of the previously mentioned arrays to find a "parent" object's UUID, and then proceed to find the "child's" UUID, I added a DataController w/ a singleton instance to simplify the process.
I had hoped that the DataController could provide a single access point to the local database and make things easier, but I'm not so certain that is the case. Basically, what I did is create multiple NSMutableDicationaries. Whenever the DataController is initialized, it enumerates through each of the previously mentioned NSMutableArrays maintained in the Application Delegate and creates a key/value pair in the corresponding dictionary, using the given object as the value and it's UUID as the key.
The DataController then exposes procedures that allow a client to call in w/ a desired object's UUID to retrieve a reference to the actual object. Whenever their is a request for an object, the DataController automatically hydrates the object in question and then returns it. I did this because I wanted to take control of hydration out of the client's hands to prevent dehydrating an object being referenced multiple times.
I realize that in most cases I could just make a mutable copy of the object and then if necessary replace the original object down the road, but I wanted to avoid that scenario if at all possible. I therefore added an additional dictionary to monitor what objects are hydrated at any given time using the object's UUID as the key and a fluctuating count representing the number of hydrations w/out an offset dehydration. My goal w/ this approach was to have the DataController automatically dehydrate any object once it's "hydration retainment count" hit zero, but this could easily lead to significant memory leaks as it currently relies on the caller to later call a procedure that decreases the hydration retainment count of the object. There are obviously many cases when this is just not obvious or maybe not even easily accomplished, and if only one calling object fails to do so properly I encounter the exact opposite scenario I was trying to prevent in the first place. Ironic, huh?
Anyway, I'm thinking that if I proceed w/ this approach that it will just end badly. I'm tempted to go back to the original plan but doing so makes me want to cringe and I'm sure there is a more elegant solution floating around out there. As I said before, any advice would be greatly appreciated. Thanks in advance.
I'd also be aware (as I'm sure you are) that CoreData is just around the corner, and make sure you make the right choice for the future.
Have you considered implementing this via the NSCoder interface? Not sure that it wouldn't be more trouble than it's worth, but if what you want is to extract all the data out into an in-memory object graph, and save it back later, that might be appropriate. If you're actually using SQL queries to limit the amount of in-memory data, then obviously, this wouldn't be the way to do it.
I decided to go w/ Core Data after all.