Concurrent read operations on MongoDB - mongodb

I have a scala application which is accessing a Mongo Collection with 13 million records over 4 threads.
I want the four threads to access Mongo concurrently and want to make sure that they never read a same record. Also, a record accessed by thread 2 in pass 3 should not be accessed by any other thread in future.
Any suggestion on how could I achieve it?

It looks a good place for a dispatcher feature.
Dispatcher will need to read all ids and then using let's say roundRobin queue push ids to f1,f2,f3,f4. There is no lock mechanism that will prevent to read data from SINGLE document so when id will dispatched then underling function will have to carry all operations.

Related

Understanding Persistent Entities with streams of data

I want to use Lagom to build a data processing pipeline. The first step in this pipeline is a service using a Twitter client to supscribe to a stream of Twitter messages. For each new message I want to persist the message in Cassandra.
What I dont understand is given I model my Aggregare root as a List of TwitterMessages for example, after running for some time this aggregare root will be several gigabytes in size. There is no need to store all the TwitterMessages in memory since the goal of this one service is just to persist each incomming message and then publish the message out to Kafka for the next service to process.
How would I model my aggregate root as Persistent Entitie for a stream of messages without it consuming unlimited resources? Are there any example code showing this usage if Lagom?
Event sourcing is a good default go to, but not the right solution for everything. In your case it may not be the right approach. Firstly, do you need the Tweets persisted, or is it ok to publish them directly to Kafka?
Assuming you need them persisted, aggregates should store in memory whatever they need to validate incoming commands and generate new events. From what you've described, your aggregate doesn't need any data to do that, so your aggregate would not be a list of Twitter messages, rather, it could just be NotUsed. Each time it gets a command it emits a new event for that Tweet. The thing here is, it's not really an aggregate, because you're not aggregating any state, you're just emitting events in response to commands with no invariants or anything. And so, you're not really using the Lagom persistent entity API for what it was made to be used for. Nevertheless, it may make sense to use it in this way anyway, it's a high level API that comes with a few useful things, including the streaming functionality. But there are also some gotchas that you should be aware of, you put all your Tweets in one entity, you limit your throughput to what one core on one node can do sequentially at a time. So maybe you could expect to handle 20 tweets a second, if you ever expect it to ever be more than that, then you're using the wrong approach, and you'll need to at a minimum distribute your tweets across multiple entities.
The other approach would be to simply store the messages directly in Cassandra yourself, and then publish directly to Kafka after doing that. This would be a lot simpler, a lot less mechanics involved, and it should scale very nicely, just make sure you choose your partition key columns in Cassandra wisely - I'd probably partition by user id.

How do I place a read lock on MongoDB?

My application needs to access a Mongo db where if more than one process/thread is reading from a specific collection, bad things will happen.
I need to restrict the ability of a group of processes to read from the collection (or db, if need be). So for example, if there are multiple processes trying to read from the db, they read sequentially, not in parallel.
This could be done in the driver level. If you set connection pool size to 1 then all access to to database will be in sequence.
In nodejs you can set the driver as:
MongoClient.connect(url, {
poolSize: 1
});
From the documentation:
poolSize, this allows you to control how many tcp connections are
opened in parallel. The default value for this is 5 but you can set it
as high as you want. The driver will use a round-robin strategy to
dispatch and read from the tcp connection.

shared queue between multiple process in perl

The following senario was done using threads
A large queue #work_queue populated/enqueued by the main thread. Used Thread::Queue here.
≥ 2 connection objects of something are added in #conns which had to be loaded serially as part of the loading process uses Expect->spawn
Multiple Worker threads are invoked, and each thread given a single $conns[$i] object & reference to the shared \#work_queue.
The worker threads safely removes a single item from #work_queue and performs some processing through its connection object, after which it picks up the next available item from #work_queue.
When this #work_queue is empty all the threads will shutdown safely
Now, the problem is that the loading phase is taking too long in many cases. But due to the use of Expect->spawn, parallel loading of #conns is possible only on a separate process & not on a thread.
Please suggest a good way to achieve the above scenario using fork. Or, even better if there is a way to use Expect->spawn with threads. (UNIX/LINUX only)
See Is it possible to use threads with Expect?

Filtering Redis Hash Entries

I'm using redis to store hashes with ~100k records per hash. I want to implement filtering (faceting) the records within a given hash. Note a hash entry can belong to n filters.
After reading this and this it looks like I should:
Implement a sorted SET per filter. The values within the SET correspond to the keys within a HASH.
Retrieve the HASH keys from the given filter SET.
Once I have the HASH keys from the SET fetch the corresponding entries from the HASH. This should give me all entries that belong to the filter.
Firstly is the above approach correct at a high level?
Assuming the approach is OK the bit I'm missing is what's the most efficient implementation to retrieve the HASH entries? Am I right in thinking once I have the HASH keys I should then use a PIPELINE to queue multiple HGETALL commands passing through each HASH key? Is there a better approach?
My concern about using a PIPELINE is that I believe it will block all other clients while servicing the command. I'll be paging the filtered results with 500 results per page. With multiple browser based clients performing filtering, not to mention the back end processes that populate the SETs and HASHes it sounds like there's potential for a lot of contention if PIPELINE does block. Could anyone provide a view on this?
If it helps I'm using 2.2.4 redis, predis for the web clients and servicestack for the back end.
Thanks,
Paul
Redis is a lock-free non-blocking async server so there is no added contention when using pipelining. Redis hums along happily processing each operation as soon as it receives them so in practice can process multiple pipelined operations. In essence redis-server really doesn't care if the operation is pipelined or not it just processes each operation as it receives them.
The benefit of pipelining is to reduce client latency where instead of waiting for a response from redis-server for each operation before sending the next one, the client can just pump all operations at once in a single write then read back all the responses in a single read.
An example of this in action is in my Redis mini StackOverflow clone each click makes a call to ToQuestionResults() which because the operations are pipelined sends all operations on 1 Socket write call and reads the results in 1 Socket blocking read which is more efficient instead of a blocking read per call:
https://github.com/ServiceStack/ServiceStack.Examples/blob/master/src/RedisStackOverflow/RedisStackOverflow.ServiceInterface/IRepository.cs#L180
My concern about using a PIPELINE is
that I believe it will block all other
clients while servicing the command.
This is not a valid concern and I wouldn't over think how Redis works here, assume it's doing it the most efficiently where Pipelining doesn't block processing of other clients commands. Conceptually you can think that redis-server processes each command (pipelined or not) in FIFO order (i.e. no time is wasted in waiting/reading the entire pipeline).
You're describing something closer to MULTI/EXEC (i.e. Redis Transactions) where all operations are done at once as soon as Redis server reads EXEC (i.e. EOF Transaction). This is not a problem either and redis-server still doesn't waste any time waiting to receive your entire transaction, it just queues the partial commandset in a temporary queue until it receives the final EXEC which is then processed all at once.
This is how redis achieves atomicity by processing each command, one at a time, as soon as it receives them. Since there are no other threads, there is no thread context switching, no locks and no multi-threading issues. It basically achieves concurrency by processing each command really fast.
So in this case I would use Pipelining as it's always a win, more so the more commands you pipeline (as you reduce the blocking read count).
Individual operations do block, but it doesn't matter as they shouldn't be long running. It sounds like you are retrieving more information than you really need - HGETALL will return 100,000 items when you only need 500.
Sending 500 HGET operations may work (assuming the set stores both hash and key) though it's possible that using hashes at all is a case of premature optimization - you may be better off using regular keys and MGET.
I think you misunderstand what pipelining does. It doesn't block while all the commands are being sent. All it's doing is BUFFERING the commands, then executing them all at once at the end, so they are executed as if they are one single command. At no time is blocking occurring. The same is true for redis multi/exec. The closest thing you get to blocking/locking in redis is optimistic locking by using the watch, which will cause exec to fail if the redis key has been written to since you called watch.
Even more efficient that calling hget 500 times within a pipeline block is to just call hmget('hash-key',*keys) where keys is an array of the 500 hash keys you are looking up. This will result in a single call to redis, which is the same as if it was pipelined, but should be faster to execute since you aren't looping in ruby.

MongoDB Schema Design - Real-time Chat

I'm starting a project which I think will be particularly suited to MongoDB due to the speed and scalability it affords.
The module I'm currently interested in is to do with real-time chat. If I was to do this in a traditional RDBMS I'd split it out into:
Channel (A channel has many users)
User (A user has one channel but many messages)
Message (A message has a user)
The the purpose of this use case, I'd like to assume that there will be typically 5 channels active at one time, each handling at most 5 messages per second.
Specific queries that need to be fast:
Fetch new messages (based on an bookmark, time stamp maybe, or an incrementing counter?)
Post a message to a channel
Verify that a user can post in a channel
Bearing in mind that the document limit with MongoDB is 4mb, how would you go about designing the schema? What would yours look like? Are there any gotchas I should watch out for?
I used Redis, NGINX & PHP-FPM for my chat project. Not super elegant, but it does the trick. There are a few pieces to the puzzle.
There is a very simple PHP script that receives client commands and puts them in one massive LIST. It also checks all room LISTs and the users private LIST to see if there are messages it must deliver. This is polled by a client written in jQuery & it's done every few seconds.
There is a command line PHP script that operates server side in an infinite loop, 20 times per second, which checks this list and then processes these commands. The script handles who is in what room and permissions in the scripts memory, this info is not stored in Redis.
Redis has a LIST for each room & a LIST for each user which operates as a private queue. It also has multiple counters for each room the user is in. If the users counter is less than the total messages in the room, then it gets the difference and sends it to the user.
I haven't been able to stress test this solution, but at least from my basic benchmarking it could probably handle many thousands of messages per second. There is also the opportunity to port this over to something like Node.js to increase performance. Redis is also maturing and has some interesting features like Pub/Subscribe commands, which might be of interest, that would possibly remove the polling on the server side possibly.
I looked into Comet based solutions, but many of them were complicated, poorly documented or would require me learning an entirely new language(e.g. Jetty->Java, APE->C),etc... Also delivery and going through proxies can sometimes be an issue with Comet. So that is why I've stuck with polling.
I imagine you could do something similar with MongoDB. A collection per room, a collection per user & then a collection which maintains counters. You'll still need to write a back-end daemon or script to handle manging where these messages go. You could also use MongoDB's "limited collections", which keeps the documents sorted & also automatically clears old messages out, but that could be complicated in maintaining proper counters.
Why use mongo for a messaging system? No matter how fast the static store is (and mongo is very fast), whether mongo or db, to mimic a message queue your going to have to use some kind of polling, which is not very scalable or efficient. Granted you're not doing anything terribly intense, but why not just use the right tool for the right job? Use a messaging system like Rabbit or ActiveMQ.
If you must use mongo (maybe you just want to play around with it and this project is a good chance to do that?) I imagine you'll have a collection for users (where each user object has a list of the queues that user listens to). For messages, you could have a collection for each queue, but then you'd have to poll each queue you're interested in for messages. Better would be to have a single collection as a queue, as it's easy in mongo to do "in" queries on a single collection, so it'd be easy to do things like "get all messages newer than X in any queues where queue.name in list [a,b,c]".
You might also consider setting up your collection as a mongo capped collection, which just means that you tell mongo when you set up the collection that your collection should only hold X number of bytes, or X number of items. Adding additional items has First-In, First-Out behavior which is pretty much ideal for a message queue. But again, it's not really a messaging system.
1) ape-project.org
2) http://code.google.com/p/redis/
3) after you're through all this - you can dumb data into mongodb for logging and store consistent data (users, channels) as well