Distributed Lucene.net search (sharding?) with Akka.Cluster? - lucene.net

Building a web application distributed over several nodes in a cluster, I want to explore wether we can do content search (not full-text search but data queries) using Lucene.net.
From what I can see, constructing an Akka.net-based cluster of Actors for indexing/searching may not be so difficult ... But achieving some of the functionality used in Elasticsearch would also be nice, particularly moving shards between nodes, replicating shards depending on topology ... etc.
If we post an "index this content" message to one of the index/search nodes, and that node goes down, then that cache is lost. On the other hand, if one of the nodes gets the message while a node already having indexed the content comes back, it will be duplicated.
So the Lucene.net indices need to be curated continously, I think. But how?

Related

How to throttle MongoDB write operations on one shard

We have a sharded MongoDB Cluster. One of our biggest collections is sharded based on the document id.
During some occasions, the interactions on one document increase a lot, and then because of this, we get the MongoDBLockQueueBacklogging error which brings the whole cluster down.
We don't really need those interactions to be persisted in real-time, we can throttle them for some milliseconds.
I was reading about the Kafka-MongoDB connector, and it sounds like a very good idea. The problem is we cannot use update operators ($inc, $current_date, etc.) with it.
I have a feeling that there is something already for such a problem but I cannot find it (or I am not searching correctly)

MongoDB - Can I shard all new db's (Created by application) automatically?

My team will deploy a new version of our app (Capture social media posts, hashtags etc.) they create a different DB for each user and we may have thousands of collections on each DB. I read all mongoDB shard documentation and I saw that I can only shard an collection or one DB at time, I'm missing something ?
We will start this new version fresh, without any databases and we will grow from 0 again (For now, we have 23k users) but we will escalate this number really quickly (100.000+ at the end of the year)
My question is: I really need a Shard cluster ? (My test setup have 3 shards with 3 microshards, 3 config servers and 2 mongos) for now, in production, i have a large server doing all the hard work but i dont want to scale to top, the horizontal scale is the best choice, i think.
Can I shard all my databases automatically or I really need to do that one by one doing the shard key procedure and so. ?
Thanks in advance
You are reading correctly. What you intend to do is so far away from what any sensible person would do that MongoDB doesn't offer any tools to support this. If you really want to go with this WTF solution, your application will be responsible to set up sharding for each collection it creates. This forces you to give administration permission to the application (despite what any security guides recommend).
"Will you really need a sharded cluster" - that depends on how much data you will have and how often you query it with what kind of query. But it is unlikely to work anyway, because your sharded cluster will have to manage (100,000 databases* 1.000 collections) = a hundred million collections. MongoDB is not designed for scaling in that direction. The cluster will likely be so busy with bookkeeping that you won't really see any notable performance gain.
It is also questionable if clustering would even theoretically make sense. Clustering is usually only useful when you have very large collections. But in your scenario where your data is so heavily fragmented into a million collections, each individual collection is unlikely to be very large.
If you really want to go this route, it might in fact be a better solution to separate the databases physically by assigning each user to a database server.
Or you could just build a database architecture like a normal team would with one database for all users and one collection per type of document. You would then speed up lookups by creating a compound index on user and whatever criteria you used to tell which database a document belonged to. This index might also be a good shard key.

Configure a Mongo replica set to only replicate certain collections

I have a ~3GB mongo database with several dozen collections. Three of these collections handle ~300 queries per second, while the rest sustain a much lower volume. I expect the traffic to continue to grow quickly.
I'd like to set up a replica set to handle the high-traffic collections. It isn't necessary for this new instance to replicate the rest of the database. Is this possible?
Seems like not possible at the moment by built-in features of mongodb and only way to do is to come up with your own manual replication algorithm or use some other tools written by third parties.
https://github.com/wordnik/wordnik-oss project might help you to achieve this according to the following post.
https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/Ap9V4ArGuFo
Describes workaround to filter documents in replication.
Replicate only documents where {'public':true} in MongoDB
Or just replicate the data yourself manually which might worth trying.
Good luck.
No that isn't possible now. What you could do is move those collections into another unreplicated database. But this will cause headaches once these collections see higher traffic too, so you would need to move them into your "replication"-db.
But in general Replication isn't the way to go if you need to scale, it's more considered for DR/failover. Replicaset Secondaries can only (optionally) answer read queries but no write queries, this is something you should keep in mind. So if you have high write load this may not cure your problem.
Once you allow your application to read from secondaries you need to live with eventual consistency, meaning that your application isn't guaranteed to see always the latest data. This is caused due to the asynchronous replication to the secondaries.
Indeed you can cure this problem if you configure your writeconcern, so that the write needs to succeeded on all replicas, before it's considered written and your driver returns. But this may slow down your write operations significant.
So for scaling query execution capabilities I would go with Sharding. This is possible on a per collection level, all unsharded collections will remain on a "default-shard".
Not possible but then if the data size is so small and these collections aren't updated, then the only overhead of having them replicated is the small storage size on the secondary. That is a relatively small price to pay, especially since the collections won't grow in size, compared with writing your own replication logic.
Instead of that archive the data, and have only the latest data set on the production server and the rest of the data can archive on the new server.

Tag aware sharding in Mongodb

I have been reading about tag aware sharding..These are the links I referred:
http://www.mongodb.org/display/DOCS/Tag+Aware+Sharding
http://www.kchodorow.com/blog/2012/07/25/controlling-collection-distribution/
Kristina has explained the concept in a very lucid manner and one thing is for sure: this enhancement is going to make MongoDB more developer-friendly.
But my question is.. It looks like tagging/retagging is meant to easily migrate chunks around..get all writes on to a preferred data center etc..But how does this fit into the old system of range partitioning and the way Mongo learns key-distributions for balancing? It is said that the shard-key cannot be changed, and that's because the data is assumed to be distributed across shards and changing the shard-key would disturb this. Isn't applying a tag essentially doing the same? So is tag-aware sharding meant to handle this problem?
EDIT:
And any idea how are the indexes affected by such huge migrations?
Aafreen,
You are correct. At this stage shard-tagging performs many of the same functions as balancing with the shard key. The one thing it does not do is perform any level of distribution beyond that of tagging. So it is probably more correct to say that the tagging architecture lives on top of the existing sharding architecture.
You must keep in mind that tagging only governs:
a) where tagged data will go, untagged data will use the shard-key
b) that tagged data shared amongst a number of tagged servers will still need to be distributed
You can most certainly use the tag aware sharding to manually control data distribution in the same manner that the balancer does now, by making granular enough tags so that data is put where you want it and distributed evenly.
The use case however, is more like the Documentation you linked. Where you have a large number of shards broken up into a smaller subset. In this example you would be tagging each object and then the tag would push it to the correct geographic location (for lower latency retrieval) and once within the correct geography the original sharding architecture would take over and distribute amongst the tagged shards.
As for indexes, they are heavily affected by migrations, as they need to be repointed. But the level of load is the same for that of a large number of chunk migrations - like adding a new shard to a cluster.

Scaling MongoDB on EC2 or should I just switch to DynamoDB?

I currently run my website on a single server with MongoDB. On my server I have two components (1) a crawler that runs hourly and appends data to my MongoDB instance (2) a web-site that reads from the crawler index and also writes to a user personalization DB. I am moving to Amazon EC2 for auto-scaling, so that web-server can auto-scale, so I can increase the number of servers as the web-traffic increases. I don't need auto-scaling for my crawler. This poses a challenge for how I use MongoDB. I'm wondering what my best option is to optimize on
Minimal changes to my code (the code is in perl)
Ability to seamlessly add/remove web-servers without worry about losing data in the DB
Low cost
In the short-term, the DB will certainly be able to fit in memory across all machies since it will be under 2 GB. The user personalization DB can't be rebuilt so its more important to have this, while the index can easily be re-built. The current MongoDB crawl index has about 100k entries that are keyed on ~15 different columns. This is built for speed, as I am working on an online dating site (that is searchable in many ways).
I can think of a few options
Use SimpleDB for the user personalization store, and MongoDB for the index. Have the index replicate across all machines, however, I don't know too much about MongoDB replication.
Move everything to SimpleDB
Move everything to DynamoDB
I don't know too much about SimpleDB and/or DynamoDB. Based on articles it seems like DynamoDB would bew a natural choice, but I'm not sure about good perl support, whether I can have all columns, index, etc. Anyone have experience or have any advice?
You could host Mongo on a single server on EC2 which each of the boxes in the web farm connect to. You can then easily spin up another web instance that uses the same DB box.
We currently have three Mongo servers as we run a replica set and when we get to the point where we need to scale horizontally with Mongo we'll spin up some new instances and shard the larger collections.
I currently run my website on a single server with MongoDB.
First off, this is a big red flag. When running on production, it is always recommended to run a replica set with at least three full nodes.
Replication provides automatic redundancy and fail-over.
Ability to seamlessly add/remove web-servers without worry about losing data in the DB
MongoDB supports a concept called sharding. Sharding provides a way to scale horizontally by automatically partioning data. The partitioning is done via a shard key.
If you plan to use sharding, please read that link very carefully and recognize the limitations. For MongoDB sharding you have to select the correct key that will allow queries to be evenly distributed across the shards.
The current MongoDB crawl index has about 100k entries that are keyed on ~15 different columns.
This is going to be a problem with sharding. Sharding can only scale queries that use the shard key. A query on the shard key can be routed directly to a single machine. A query on a secondary index goes to all machines.
You have 15 different indexes, so basically all of these queries will go to all shards. That will not "auto-scale" very well at all.
Beware that at the moment EC2 does not have 64 bit small instances, making replication potentially expensive. Because MongoDB memory maps files, a 32 bit OS is not advised.
I've had very bad experiences with SimpleDB and think it's fundamentally flawed, so I would avoid it.
Three is a good white paper on how to set up MongoDB on Amazon EC2: http://d36cz9buwru1tt.cloudfront.net/AWS_NoSQL_MongoDB.pdf
I suspect setting up MongoDB on EC2 is the fastest solution versus rewriting-for/migrating-to DynamoDB.
Best of luck!