Ensuring evenly distributed documents per shard in solr

Ensuring evenly distributed documents per shard in solr - hash

I've found myself needing to support result grouping with an accurate ngroups count. This required colocation of documents by a secondaryId field.
I'm currently indexing documents using the compositeId router in solr. The uniqueKey is documentId and I'm adding a shard key at the front like this:
doc.addField("documentId", secondaryId + "!" + actualDocId);
The problem I'm seeing is that the document count accross my 3 shards is now uneven:
shard1: ~30k
shard1: ~60k
shard1: ~30k
(This is expected to grow a lot.)
Apparently the hashes of secondaryId are not very evenly distributed, but I don't know enough about possible values.
Any thoughts on getting a better distribution of these documents?

Your data is not evenly spread across you secondaryIds. Some secondary ids have a lot more data than others. There is no perfect and/or simple solution.
Assuming you cannot change your routing id, one approach is to create a larger number of shards, say 16 on same number of hosts. Your shards will now be smaller and still potentially uneven. But given their larger numbers, you can then move your shards around across the nodes you have, to more or less balance out the nodes in size.
The caveat is that you have routed queries so that each query hits only one shard. If you have unrouted queries, having a large number of shards can result in significant performance degradation as each query will need to be run against each shard.

What I've done is read the Solr routing code to see how it hashes. Then replicate some of the logic manually to figure out the hash ranges to split.
I found these online tools to convert the Ids to hash then back and forth to Hex which is what the shard split command wants.
Murmur hash app: http://murmurhash.shorelabs.com/
Use “MurmurHash3” form.
Hex converter app: https://www.rapidtables.com/convert/number/decimal-to-hex.html
I think want “Hex signed 2's complement” when different, but not when has 00000000 prefix...
You'll also have to pay attention to masking. It's somethinglike:
Imagine you have a document hashed to a HEX values of 12345678. This is a composite of:
primaryRouteId: 12xxxxxx
secondaryRouteId:xx34xxx
documentId: xxxx5678
(Note if you only have a primaryRouteId!docId then primaryRouteId takes the first 4 spots.)

You can use Solr rebalancing with the feature called UTILIZENODE.
Check these links :
https://solr.apache.org/guide/8_4/cluster-node-management.html#utilizenode
https://solr.pl/en/2018/01/02/solr-7-2-rebalancing-replicas-using-utilizenode/
It will automatically handle the uneven shards and will balance them across all the servers.
Note : It is a new feature and will work only with Solr version greater than equal to 8.2

Related

MongoDB Sharding Key

We have a large MongoDB collection that we'd like to start sharding. The collection has 3.4B records and is ~14.6TB in size (5.3TB compressed on disk). This collection typically sees writes on the order of ~5M per hour, but we expect this to continue to grow year over year. The indexes on this collection are ~220GB in size.
All records have a feedId and all queries will be for records belong to a specific feedId. There are currently ~200 unique feedId values, but the distribution across each value is highly non-linear. On the low end, some feedId's may only see dozens of records per day. On the other hand, the top 5 feedId's make up ~75% of the dataset.
Records also have a timestamp and queries will always be for a given date range. The timestamp field is more-or-less monotonic.
There is already an existing compound index on feedId and timestamp.
The typical working set for this collection is only the last few weeks worth of data, and is therefor only a very small percentage of the actual data. Queries for this data must be very fast, with slower queries for the historical data being acceptable. As such, we're planning to use "tags" and/or "zones" to move older data to nodes with larger HDD's and use nodes with SSD's for the "hot" data.
Based on these factors, is using a shard key of {feedId: 1, timestamp: 1} reasonable? My feeling is that it may lead to "hot" nodes due to the non-linearity of feedId and the monotonic nature of timestamp. Would adding a "hashed" field to the key make it better/worse?

So lets take this bit by bit!
The collection has 3.4B records and is ~14.6TB in size (5.3TB compressed on disk)
The nature of sharding is such that it's important to get this right the first time through. I'm going to go into more detail here, but the TL;DR is:
Extract a portion of your dataset (e.g. using mongodump --query) to a staging cluster (e.g. using mongorestore)
Point a sample workload at the staging cluster to emulate your production environment
Test one or more shard key combinations. Dump/reload as needed until you're satisfied with performance.
Now, lets dig in:
There are currently ~200 unique feedId values, but the distribution across each value is highly non-linear. On the low end, some feedId's may only see dozens of records per day. On the other hand, the top 5 feedId's make up ~75% of the dataset.
So one field that supports a good chunk of your queries has pretty low frequency. You are definitely likely to see hotspotting if you were just sharding on this field 1
Records also have a timestamp and queries will always be for a given date range. The timestamp field is more-or-less monotonic.
So another field that supports the majority of your queries, but also not great for sharding 2
Records also have a timestamp and queries will always be for a given date range. The timestamp field is more-or-less monotonic.
This to me kind of implies that the primary field you're querying against is time based. For a given period of time give me the documents with the specified feedID. You're also going to get targeted queries, because you're querying on the shard key more often than not (e.g. either on a range of time, or a range of time + the feedId). 3
This also supports your idea for zoning:
As such, we're planning to use "tags" and/or "zones" to move older data to nodes with larger HDD's and use nodes with SSD's for the "hot" data.
With zoning, you can use any key in the shard key, as long as you include the entire prefix leading up to that key. So { feedId: 1, timestamp: 1 } would principally support zones on feedId and timestamp, which isn't quite what you are looking for. 4
Based on that alone, I would venture that { timestamp : 1, feedId : 1 } would be a good selection. What your testing would need to look into is whether adding
a low-frequency field to a monotonically increasing field provides good chunk distribution.
Now, as far as hashing:
Would adding a "hashed" field to the key make it better/worse?
If you mean, your documents already have some hashed field, then you could definitely add that just for randomness. But if you're talking about a hashed shard key, then that's a different story. 5
Zones and hashed shard keys don't play together. The nature of the hashed shard key means that the chunk ranges (and therefore zones) represent the hashed shard key values. So even if you have two documents with values that are very near to each other, they are likely to end up on completely different chunks. So creating a zone on a range of hashed shard key values probably wont do what you want it to do. You could do something like using zones with hashed sharding to move the entire collection onto a subset of shards in the cluster, but that's not what you want to do. 6
Now there is one critical issue you might run into - you have a huge collection. Your choice of shard key might cause issues for the initial split where MongoDB attempts to divide your data into chunks. Please take a look at the following section in our documentation: Sharding an Existing Collection. There is a formula there for you to use to estimate the max collection size your shard key can support with the configured chunk size (64MB by default). I'm going to guess that you'll need to increase your chunk size to 128MB or possibly 256MB initially. This is only required for the initial sharding procedure. Afterwards you can reduce chunk size back to defaults and let MongoDB handle the rest.
Mind you, this is going to have a performance impact. You'll have chunks migrating across shards, plus the overhead for the actual chunk split. I would recommend you post to our Google Group for more specific guidance here.

Timeseries storage in Mongodb

I have about 1000 sensors outputting data during the day. Each sensor outputs about 100,000 points per day. When I query the data I am only interested in getting data from a given sensor on a given day. I don t do any cross sensor queries. The timeseries are unevenly spaced and I need to keep the time resolution so I cannot do things like arrays of 1 point per second.
I plan to store data over many years. I wonder which scheme is the best:
each day/sensor pair corresponds to one collection, thus adding 1000 collections of about 100,000 documents each per day to my db
each sensor corresponds to a collection. I have a fixed number of 1000 collections that grow every day by about 100,000 documents each.
1 seems to intuitively be faster for querying. I am using mongoDb 3.4 which has no limit for the number of collections in a db.
2 seems cleaner but I am afraid the collections will become huge and that querying will gradually become slower as each collection grows
I am favoring 1 but I might be wrong. Any advice?
Update:
I followed the advice of
https://bluxte.net/musings/2015/01/21/efficient-storage-non-periodic-time-series-mongodb/
Instead of storing one document per measurement, I have a document containing 128 measurement,startDate,nextDate. It reduces the number of documents and thus the index size but I am still not sure how to organize the collections.
When I query data, I just want the data for a (date,sensor) pair, that is why I thought 1 might speed up the reads. I currently have about 20,000 collections in my DB and when I query the list of all collections, it takes ages which makes me think that it is not a good idea to have so many collections.
What do you think?

I would definitely recommend approach 2, for a number of reasons:
MongoDB's sharding is designed to cope with individual collections getting larger and larger, and copes well with splitting data within a collection across separate servers as required. It does not have the same ability to split data which exists in many collection across different servers.
MongoDB is designed to be able to efficiently query very large collections, even when the data is split across multiple servers, as long as you can pick a suitable shard key which matches your most common read queries. In your case, that would be sensor + date.
With approach 1, your application needs to do the fiddly job of knowing which collection to query, and (possibly) where that collection is to be found. Approach 2, with well-configured sharding, means that the mongos process does that hard work for you

Whilst MongoDB has no limit on collections I tried a similar approach to 2 but moved away from it to a single collection for all sensor values because it was more manageable.
Your planned data collection is significant. Have you considered ways to reduce the volume? In my system I compress same-value runs and only store changes, I can also reduce the volume by skipping co-linear midpoints and interpolating later when, say, I want to know what the value was at time 't'. Various different sensors may need different compression algorithms (e.g. a stepped sensor like a thermostat set-point vs one that represents a continuous quantity like a temperature). Having a single large collection also makes it easy to discard data when it does get too large.
If you can guarantee unique timestamps you may also be able to use the timestamp as the _id field.

When I query the data I m only interested in getting data from a
given sensor on a given day. I don t do any cross sensor queries.
But that's what exactly what Cassandra is good for!
See this article and this one.
Really, in one of our my projects we were stuck with legacy MongoDB and the scenario, similar to yours, with the except of new data amount per day was even lower.
We tried to change data structure, granulate data over multiple MongoDB collections, changed replica set configurations, etc.
But we were still disappointed as data increases, but performance degrades
with the unpredictable load and reading data request affects writing response much.
With Cassandra we had fast writes and data retrieving performance effect was visible with the naked eye. If you need complex data analysis and aggregation, you could always use Spark (Map-reduce) job.
Moreover, thinking about future, Cassandra provides straightforward scalability.
I believe that keeping something for legacy is good as long as it suits well, but if not, it's more effective to change the technology stack.

If I understand right, you plan to create collections on the fly, i.e. at 12 AM you will have new collections. I guess MongoDB is a wrong choice for this. If required in MongoDB there is no way you can query documents across collections, you will have to write complex mechanism to retrieve data. In my opinion, you should consider elasticsearch. Where you can create indices(Collections) like sensor-data-s1-3-14-2017. Here you could do a wildcard search across indices. (for eg: sensor-data-s1* or sensor-data-*). See here for wildcard search.
If you want to go with MongoDB my suggestion is to go with option 2 and shard the collections. While sharding, consider your query pattern so you could get optimal performance and that does not degrade over the period.

Approach #1 is not cool, key to speed up is divide (shard) and rule. What-if number of singal itself reaches 100000.
So place one signal in one collection and shard signals over nodes to speed up read. Multiple collections or signals can be on same node.
How this Will Assist
Usually for signal processing time-span is used like process signal for 3 days, in that case you can parallel read 3 nodes for the signal and do parallel apache spark processing.
Cross-Signal processing: typically most of signal processing algorithms uses same period for 2 or more signals for analysis like cross correlation and as these (2 or more signals) are parallel fetch it'll also be fast and ore-processing of individual signal can be parallelized.

Generating shard key field for multi tenant mongodb app

I'm working on a multi-tenant application running on mongodb. Each tenant can create multiple applications. The schema for most of the collections reference other collections via ObjectIDs. I'm thinking of manually creating a shard key with every record insertion in the following format:
(v3 murmurhash of the record's ObjectId) + (app_id.toHexString())
Is this good enough to ensure that records for any particular application will likely end up on the same shard?
Also, what happens if a particular application grows super large compared to all others on the shard?

If you use a hash based shard key with the input constantly changing (ObjectID can generally be considered to be unique for each record), then you will get no locality of data on shards at all (except by coincidence), though it will give you great write throughput by randomly distributing writes across all shards. That's basically the trade off with this kind of approach, the same is true of the built in hash based sharding, those trade offs don't change just because it is a manual hash constructed of two fields.
Basically because MongoDB uses range based chunks to split up the data for a given shard key you will have sequential ranges of hashes used as chunks in this case. Assuming your hash is not buggy in some way, then the data in a single sequential range will basically be random. Hence, even within a single chunk you will have no data locality, let alone on a shard, it will be completely random (by design).
If you wanted to be able to have applications grouped together in ranges, and hence more likely to be on a particular shard then you would be better off to pre-pend the app_id to make it the leftmost field in a compound shard key. Something like sharding on the following would (based on the limited description) be a good start:
{app_id : 1, _id : 1}
Though the ObjectID is monotonically increasing (more discussion on that here) over time, if there are a decent number of application IDs and you are going to be doing any range based or targeted queries on the ObjectID, then it might still work well though. You may also want to have other fields included based on your query pattern.
Remember that whatever your most common query pattern is, you want to have the shard key (ideally) satisfy it if at all possible. It has to be indexed, it has be used by the mongos to decide to route the query (if not, then it is scatter/gather), so if you are going to constantly query on app_id and _id then the above shard key makes a lot of sense.
If you go with the manual hashed key approach not only will you have a random distribution, but unless you are going to be querying on that hash it's not going to be very useful.

MongoDB shard key

I've been thinking about selecting the best shard key (through a compound index) for my data and thought the combination of the document creation date combined with a customer no. (or invoice no.) would be a good combination. IF MongoDB would consider the customer no as a string backwards ie.:
90043 => 34009
90044 => 44009
90045 => 54009
etc.
Index on the The creation date would ensure that relatively new data are kept in memory and the backward customer no would help MongoDB to distribute the data/load across the cluster.
Is this a correct assumption? and if so... would I need to save my customer no reversed for it to be distributed the way I expect?

Regarding your specific question of "would I need to save my customer no reversed for it to be distributed the way I expect?", no - you would not.
Even with the relatively narrow spread of customer number values you listed, if you use customerNumber in your compound key, MongoDB will break apart the data into chunks and distribute these accordingly. As long as the data associated with customerNumber are relatively evenly distributed (e.g., one user doesn't dominate the system), you will get the shard balancing you desire.
I would consider either your original choice (minus the string reversal) or Dan's choice (using the built-in ObjectId instead of timestamp) as good candidates for your compound key.

from what I have read in the documentation the MongoId is already time based.
Therfore you can add the _id to your compound key like this: (_id, customerid). If you don't need the date in your application, you can just drop the field which would save you some storage.
MongoDB stores the datasets recently used in memory.
The index of a collection will always tried to be stored into RAM.
When an index is too large to fit into RAM, MongoDB must read the
index from disk, which is a much slower operation than reading from
RAM. Keep in mind an index fits into RAM when your server has RAM
available for the index combined with the rest of the working set.
Hope this helps.
Cheers dan

I think the issue with your thinking it that, somehow, you feel Node 1 would be faster than Node 2. Unless the hardware is drastically different then Node 1 and Node 2 would be accessed equally fast and thus reversing the strings would not help you out.
The main issue I see has to do with the number of customers in your system. This can lead to monotonic sharding wherein the last shard is the one always being hit and that can cause excessive splitting and migration. If you have a large number of customers then there is no issue, otherwise you might want to add another key on top of the customer id and date fields to more evenly divide up your content. I have heard of people using random identifiers, hashing the _id or using a GUID to overcome this issue.

MongoDB: BIllions of documents in a collection

I need to load 6.6 billion bigrams into a collection but I can't find any information on the best way to do this.
Loading that many documents onto a single primary key index would take forever but as far as I'm aware mongo doesn't support the equivalent of partitioning?
Would sharding help? Should I try and split the data set over many collections and build that logic into my application?

It's hard to say what the optimal bulk insert is -- this partly depends on the size of the objects you're inserting and other immeasurable factors. You could try a few ranges and see what gives you the best performance. As an alternative, some people like using mongoimport, which is pretty fast, but your import data needs to be json or csv. There's obviously mongodrestore, if the data is in BSON format.
Mongo can easily handle billions of documents and can have billions of documents in the one collection but remember that the maximum document size is 16mb. There are many folk with billions of documents in MongoDB and there's lots of discussions about it on the MongoDB Google User Group. Here's a document on using a large number of collections that you may like to read, if you change your mind and want to have multiple collections instead. The more collections you have, the more indexes you will have also, which probably isn't what you want.
Here's a presentation from Craigslist on inserting billions of documents into MongoDB and the guy's blogpost.
It does look like sharding would be a good solution for you but typically sharding is used for scaling across multiple servers and a lot of folk do it because they want to scale their writes or they are unable to keep their working set (data and indexes) in RAM. It is perfectly reasonable to start off with a single server and then move to a shard or replica-set as your data grows or you need extra redundancy and resilience.
However, there are other users use multiple mongods to get around locking limits of a single mongod with lots of writes. It's obvious but still worth saying but a multi-mongod setup is more complex to manage than a single server. If your IO or cpu isn't maxed out here, your working set is smaller than RAM and your data is easy to keep balanced (pretty randomly distributed), you should see improvement (with sharding on a single server). As a FYI, there is potential for memory and IO contention. With 2.2 having improved concurrency with db locking, I suspect that there will be much less of a reason for such a deployment.
You need to plan your move to sharding properly, i.e. think carefully about choosing your shard key. If you go this way then it's best to pre-split and turn off the balancer. It will be counter-productive to be moving data around to keep things balanced which means you will need to decide up front how to split it. Additionally, it is sometimes important to design your documents with the idea that some field will be useful for sharding on, or as a primary key.
Here's some good links -
Choosing a Shard Key
Blog post on shard keys
Overview presentation on sharding
Presentation on Sharding Best Practices

You can absolutely shard data in MongoDB (which partitions across N servers on the shard key). In fact, that's one of it's core strengths. There is no need to do that in your application.
For most use cases, I would strongly recommend doing that for 6.6 billion documents. In my experience, MongoDB performs better with a number of mid-range servers rather than one large one.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse