Is there any cost involved in Algolia when i add new Indices replica for my sorting requirements? - algolia

I have about 5 attributes to be sorted in Algolia and to make them sortable i found i need to add Algolia Replica Indices.
I doubt, that will duplicate data 5 times and hence will suddenly increase my bill to 5 times.

Yes, the records in the replica indices will count for the billing, like any other index.
Unfortunately there is no workaround to prevent additional cost if you need to have sorting replicas.

Related

Timeseries storage in Mongodb

I have about 1000 sensors outputting data during the day. Each sensor outputs about 100,000 points per day. When I query the data I am only interested in getting data from a given sensor on a given day. I don t do any cross sensor queries. The timeseries are unevenly spaced and I need to keep the time resolution so I cannot do things like arrays of 1 point per second.
I plan to store data over many years. I wonder which scheme is the best:
each day/sensor pair corresponds to one collection, thus adding 1000 collections of about 100,000 documents each per day to my db
each sensor corresponds to a collection. I have a fixed number of 1000 collections that grow every day by about 100,000 documents each.
1 seems to intuitively be faster for querying. I am using mongoDb 3.4 which has no limit for the number of collections in a db.
2 seems cleaner but I am afraid the collections will become huge and that querying will gradually become slower as each collection grows
I am favoring 1 but I might be wrong. Any advice?
Update:
I followed the advice of
https://bluxte.net/musings/2015/01/21/efficient-storage-non-periodic-time-series-mongodb/
Instead of storing one document per measurement, I have a document containing 128 measurement,startDate,nextDate. It reduces the number of documents and thus the index size but I am still not sure how to organize the collections.
When I query data, I just want the data for a (date,sensor) pair, that is why I thought 1 might speed up the reads. I currently have about 20,000 collections in my DB and when I query the list of all collections, it takes ages which makes me think that it is not a good idea to have so many collections.
What do you think?
I would definitely recommend approach 2, for a number of reasons:
MongoDB's sharding is designed to cope with individual collections getting larger and larger, and copes well with splitting data within a collection across separate servers as required. It does not have the same ability to split data which exists in many collection across different servers.
MongoDB is designed to be able to efficiently query very large collections, even when the data is split across multiple servers, as long as you can pick a suitable shard key which matches your most common read queries. In your case, that would be sensor + date.
With approach 1, your application needs to do the fiddly job of knowing which collection to query, and (possibly) where that collection is to be found. Approach 2, with well-configured sharding, means that the mongos process does that hard work for you
Whilst MongoDB has no limit on collections I tried a similar approach to 2 but moved away from it to a single collection for all sensor values because it was more manageable.
Your planned data collection is significant. Have you considered ways to reduce the volume? In my system I compress same-value runs and only store changes, I can also reduce the volume by skipping co-linear midpoints and interpolating later when, say, I want to know what the value was at time 't'. Various different sensors may need different compression algorithms (e.g. a stepped sensor like a thermostat set-point vs one that represents a continuous quantity like a temperature). Having a single large collection also makes it easy to discard data when it does get too large.
If you can guarantee unique timestamps you may also be able to use the timestamp as the _id field.
When I query the data I m only interested in getting data from a
given sensor on a given day. I don t do any cross sensor queries.
But that's what exactly what Cassandra is good for!
See this article and this one.
Really, in one of our my projects we were stuck with legacy MongoDB and the scenario, similar to yours, with the except of new data amount per day was even lower.
We tried to change data structure, granulate data over multiple MongoDB collections, changed replica set configurations, etc.
But we were still disappointed as data increases, but performance degrades
with the unpredictable load and reading data request affects writing response much.
With Cassandra we had fast writes and data retrieving performance effect was visible with the naked eye. If you need complex data analysis and aggregation, you could always use Spark (Map-reduce) job.
Moreover, thinking about future, Cassandra provides straightforward scalability.
I believe that keeping something for legacy is good as long as it suits well, but if not, it's more effective to change the technology stack.
If I understand right, you plan to create collections on the fly, i.e. at 12 AM you will have new collections. I guess MongoDB is a wrong choice for this. If required in MongoDB there is no way you can query documents across collections, you will have to write complex mechanism to retrieve data. In my opinion, you should consider elasticsearch. Where you can create indices(Collections) like sensor-data-s1-3-14-2017. Here you could do a wildcard search across indices. (for eg: sensor-data-s1* or sensor-data-*). See here for wildcard search.
If you want to go with MongoDB my suggestion is to go with option 2 and shard the collections. While sharding, consider your query pattern so you could get optimal performance and that does not degrade over the period.
Approach #1 is not cool, key to speed up is divide (shard) and rule. What-if number of singal itself reaches 100000.
So place one signal in one collection and shard signals over nodes to speed up read. Multiple collections or signals can be on same node.
How this Will Assist
Usually for signal processing time-span is used like process signal for 3 days, in that case you can parallel read 3 nodes for the signal and do parallel apache spark processing.
Cross-Signal processing: typically most of signal processing algorithms uses same period for 2 or more signals for analysis like cross correlation and as these (2 or more signals) are parallel fetch it'll also be fast and ore-processing of individual signal can be parallelized.

Ensuring evenly distributed documents per shard in solr

I've found myself needing to support result grouping with an accurate ngroups count. This required colocation of documents by a secondaryId field.
I'm currently indexing documents using the compositeId router in solr. The uniqueKey is documentId and I'm adding a shard key at the front like this:
doc.addField("documentId", secondaryId + "!" + actualDocId);
The problem I'm seeing is that the document count accross my 3 shards is now uneven:
shard1: ~30k
shard1: ~60k
shard1: ~30k
(This is expected to grow a lot.)
Apparently the hashes of secondaryId are not very evenly distributed, but I don't know enough about possible values.
Any thoughts on getting a better distribution of these documents?
Your data is not evenly spread across you secondaryIds. Some secondary ids have a lot more data than others. There is no perfect and/or simple solution.
Assuming you cannot change your routing id, one approach is to create a larger number of shards, say 16 on same number of hosts. Your shards will now be smaller and still potentially uneven. But given their larger numbers, you can then move your shards around across the nodes you have, to more or less balance out the nodes in size.
The caveat is that you have routed queries so that each query hits only one shard. If you have unrouted queries, having a large number of shards can result in significant performance degradation as each query will need to be run against each shard.
What I've done is read the Solr routing code to see how it hashes. Then replicate some of the logic manually to figure out the hash ranges to split.
I found these online tools to convert the Ids to hash then back and forth to Hex which is what the shard split command wants.
Murmur hash app: http://murmurhash.shorelabs.com/
Use “MurmurHash3” form.
Hex converter app: https://www.rapidtables.com/convert/number/decimal-to-hex.html
I think want “Hex signed 2's complement” when different, but not when has 00000000 prefix...
You'll also have to pay attention to masking. It's somethinglike:
Imagine you have a document hashed to a HEX values of 12345678. This is a composite of:
primaryRouteId: 12xxxxxx
secondaryRouteId:xx34xxx
documentId: xxxx5678
(Note if you only have a primaryRouteId!docId then primaryRouteId takes the first 4 spots.)
You can use Solr rebalancing with the feature called UTILIZENODE.
Check these links :
https://solr.apache.org/guide/8_4/cluster-node-management.html#utilizenode
https://solr.pl/en/2018/01/02/solr-7-2-rebalancing-replicas-using-utilizenode/
It will automatically handle the uneven shards and will balance them across all the servers.
Note : It is a new feature and will work only with Solr version greater than equal to 8.2

The most performant way to query against three collections, sort them and limit?

I am in need of querying against 3 different collections.
Then I need to sort the collection results (each based on different field and order but every time a DateTime value), and then finally limit the number of results I want (10 in my case).
Currently I'm just doing three separate queries and limiting each by 10, then manually sorting based on the date times they have. Then I finally limit to 10 myself.
Is there a better way to do this?
As mongodb is no relational database where you can join multiple tables within one query, no. I'm not even sure if you could do such kind of sorting (taking each field equal on the order precedence) for relational DBMS.
What you're doing already sounds really good. You could possibly improve your sorting of these 3 results. Aborting early to iterate over one or more collections if no further element can be within the overal top 10. You could modify your queries accordingly to only return documents for the other two collections, whose date is lower than the last one (10th) of the first queried collection. But maybe you did this already...
While talking about performance you may consider to add indexes on these datetime fields used for your query to keep the fields presorted in memory.

MongoDB shard key

I've been thinking about selecting the best shard key (through a compound index) for my data and thought the combination of the document creation date combined with a customer no. (or invoice no.) would be a good combination. IF MongoDB would consider the customer no as a string backwards ie.:
90043 => 34009
90044 => 44009
90045 => 54009
etc.
Index on the The creation date would ensure that relatively new data are kept in memory and the backward customer no would help MongoDB to distribute the data/load across the cluster.
Is this a correct assumption? and if so... would I need to save my customer no reversed for it to be distributed the way I expect?
Regarding your specific question of "would I need to save my customer no reversed for it to be distributed the way I expect?", no - you would not.
Even with the relatively narrow spread of customer number values you listed, if you use customerNumber in your compound key, MongoDB will break apart the data into chunks and distribute these accordingly. As long as the data associated with customerNumber are relatively evenly distributed (e.g., one user doesn't dominate the system), you will get the shard balancing you desire.
I would consider either your original choice (minus the string reversal) or Dan's choice (using the built-in ObjectId instead of timestamp) as good candidates for your compound key.
from what I have read in the documentation the MongoId is already time based.
Therfore you can add the _id to your compound key like this: (_id, customerid). If you don't need the date in your application, you can just drop the field which would save you some storage.
MongoDB stores the datasets recently used in memory.
The index of a collection will always tried to be stored into RAM.
When an index is too large to fit into RAM, MongoDB must read the
index from disk, which is a much slower operation than reading from
RAM. Keep in mind an index fits into RAM when your server has RAM
available for the index combined with the rest of the working set.
Hope this helps.
Cheers dan
I think the issue with your thinking it that, somehow, you feel Node 1 would be faster than Node 2. Unless the hardware is drastically different then Node 1 and Node 2 would be accessed equally fast and thus reversing the strings would not help you out.
The main issue I see has to do with the number of customers in your system. This can lead to monotonic sharding wherein the last shard is the one always being hit and that can cause excessive splitting and migration. If you have a large number of customers then there is no issue, otherwise you might want to add another key on top of the customer id and date fields to more evenly divide up your content. I have heard of people using random identifiers, hashing the _id or using a GUID to overcome this issue.

Selecting top N records per group in DynamoDB

Is NoSQL in general, and DynamoDB in particular, well suited to performing greatest-n-per-group type queries, as compared to MySQL?
DynamoDB support only 2 index and can only be queried efficiently on these.
hash key
range key (optional)
Using DynamoDB to find the biggest values in a random "row" is not a good idea at all. Querying on a random row implies scanning the whole dataset which will cost you a lot of money.
Nonetheless, if your data is properly modeled, query method may be used to find the biggest range_key for a given hash_key
Here is how to proceed:
Set the has_key
Set no filter for the range_key
limit the result count to 1
scan the index backward