Aggregation-framework: optimization - mongodb

I have a document structure like this
{
id,
companyid,
fieldA1,
valueA1,
fieldA2,
valueA2,
.....
fieldB15,
valueB15,
fieldF150
valueF150
}
my job is to multiply fieldA1*valueA1 , fieldA2*valueA2 and sum it up to new field A_sum = sum( a fields * a values), B_sum = sum(b fields * b value), C_sum , etc
then in the next step I have to generate final_sum = ( A_sumA_val + B_SumB_val .....)
I have modeled to use aggregation framework with 3 projections for the three steps of calculations - now on this point I get about 100 sec for 750.000 docs, I have index only on _id which is a GUID. CPU is at 15%
I tried to group in order to force parallel ops and load more of cpu but seems is staking longer.
What else can I do to make it faster, means for me to load more cpu, use more paralelism?
I dont need for match as I have to process all docs.

You might get it done using sharding, as the scanning of the documents would be done in parallel.
Simply measure the time your aggregation needs now, and calculate the number of shards you need using
((t/100)+1)*s
where t is the time the aggregation took in seconds and s is the number of existing shards (1 if you have a standalone or replica set), rounded up, of course. The 1 is added to be sure that the overhead of doing an aggregation in a sharded environment is leveraged by the additional shard.

my only solution is to split the collection into smaller collections (same space after all) and command computation per smaller collections (via c# console line) using parallel library so I can raise CPU to 70%.
That reduces the time from aprox 395s, 15%CPU (script via robomongo, all docs) to 25-28s, 65-70%cpu (c# console app with parallelism)
using grouping did not help in my case.
sharding is not an option now.

Related

Aggregate $lookup so slow

1st stage:
$Match
{
delFlg: false
}
2nd stage:
$lookup
{
from: "parkings",
localField: "parking_id",
foreignField: "_id",
as: "parking",
}
Using 2 IDX (ID & parking_ID).
In MongoDB /Linux/ Compass explain plan almost 9000ms.
In Local MongoDB server explain plan 1400ms.
Is there any problem? Why this our production server running so slow?
Full Explain Plan
Explain Plan
Query Performance Summary
Documents returned:
71184
Actual query execution time (ms):
8763
Query used the following indexes:
2
_id_
delFlg_1
explainVersion
"1"
stages
Array
serverInfo
Object
serverParameters
Object
command
Object
ok
1
$clusterTime
Object
operationTime
Timestamp({ t: 1666074389, i: 3 })
Using MongoDB 5.0.5 Version
Local MongoDB 6.0.2 Version
Unfortunately the explain output that was provided is mostly truncated (eg Array and Object hiding most of the important information). Even without that there should be some meaningful observations that we can make.
As always I suggest starting with more clearly defining what your ultimate goals are. What is considered "so slow" and how quickly would the operation need to return in order to be acceptable for your use case? As outlined below, I doubt there is much particular room for improvement with the current schema. You may need to rethink the approach entirely if you are looking for improvements that are several order of magnitude in nature.
Let's consider what this operation is doing:
The database goes to an entry in the { delFlg: 1 } index that has a value of false.
It then goes to FETCH that full document from the collection.
Based on the value of the parking_id field, it performs an IXSCAN of the { _id: 1 } index on the parkings collection. If a matching index entry is discovered then the database will FETCH the full document from the other collection and add it to the new parking array field that is being generated.
This full process then repeats 71,183 more times.
Using 8763ms as the duration, that means that the database performed each full iteration of the work described above more than 8 times per millisecond (8.12 ~= 71184 (iters) / 8763 (ms)). I think that sounds pretty reasonable and is unlikely to be something that you can meaningfully improve on. Scaling up the hardware that the database is running on may provide some incremental improvements, but it is generally costly, not scalable, and likely not worthwhile if you are looking for more substantial improvements.
You also mentioned two different timings and database versions in the question. If I understood correctly, it was mentioned that the (explain) operation takes ~9 seconds on version 5.0 and about 1.4 seconds on version 6.0. While there is far too little information here to say for sure, one reason for that may be associated with improvements for $lookup that were introduced in version 6.0. Indeed from their 7 Big Reasons to Upgrade to MongoDB 6.0 article, they claim the following:
The performance of $lookup has also been upgraded. For instance, if there is an index on the foreign key and a small number of documents have been matched, $lookup can get results between 5 and 10 times faster than before.
This seems to match your situation and observations pretty directly.
Apart from upgrading, what else might improve performance here? Well, if many/most documents in the source collection have { delFlg: false } then you may wish to get rid of the { delFlg: 1 } index. Indexes provide benefits when the size of the results they are retrieving are small relative to the overall size of the data in the collection. But as the percentage grows, the overhead of scanning most of the index plus randomly fetching all of the data quickly becomes less effective than just scanning the full collection directly. It is mentioned in the comments that invoices collection contains 70k documents, so this predicate on delFlg doesn't seem to remove hardly any results at all.
One other thing really stands out is this statement from the comments "parkings contains 16 documents". Should the information from those 16 documents just be moved into the parkings documents directly instead? If the two are commonly accessed together and if the parkings information doesn't change very often, then combining the two would reduce a lot of overhead and would further improve performance.
Two additional components about explain that are worth keeping in mind:
The command does not factor in network time to actually transmit the results to the client. Depending on the size of the documents and the physical topology of the environment, this network time could add another meaningful delay to the operation. Be sure that the client really needs all of the data that is being sent.
Depending on the verbosity, explain will measure the time it takes to run the operation through completion. Since there are no blocking stages in your aggregation pipeline, we would expect your initial batch of documents to be returned to the client in much less time. Apart from networking, that time may be around 12ms (which is approximately (101 docs / 71184 ) * 8763) for 5.0.

Fetching millions of records from cassandra using spark in scala performance issue

I have tried single node cluster and 3 node cluster on my local machine to fetch 2.5 million entries from cassandra using spark but in both scenarios it is takes 30 seconds just for SELECT COUNT(*) from table. I need this and similarly other counts for real time analytics.
SparkSession.builder().getOrCreate().sql("SELECT COUNT(*) FROM data").show()
Cassandra isn't designed to iterate over the entire data set in a single expensive query like this. If theres 10 petabytes in data for example this query would require reading 10 petabytes off disk, bring it into memory, stream it to coordinator which will resolve the tombstones/deduplication (you cant just have each replica send a count or you will massively under/over count it) and increment a counter. This is not going to work in a 5 second timeout. You can use aggregation functions over smaller chunks of the data but not in a single query.
If you really want to make this work like this, query the system.size_estimates table of each node, and for each range split according to the size such that you get an approximate max of say 5k per read. Then issue a count(*) for each with a TOKEN restriction for each of the split ranges and combine value of all those queries. This is how spark connector does its full table scans in the SELECT * rrds so you just have to replicate that.
Easiest and probably safer and more accurate (but less efficient) is to use spark to just read the entire data set and then count, not using an aggregation function.
How much does it take to run this query directly without Spark? I think that it is not possible to parallelize COUNT queries so you won't benefit from using Spark for performing such queries.

Timeseries storage in Mongodb

I have about 1000 sensors outputting data during the day. Each sensor outputs about 100,000 points per day. When I query the data I am only interested in getting data from a given sensor on a given day. I don t do any cross sensor queries. The timeseries are unevenly spaced and I need to keep the time resolution so I cannot do things like arrays of 1 point per second.
I plan to store data over many years. I wonder which scheme is the best:
each day/sensor pair corresponds to one collection, thus adding 1000 collections of about 100,000 documents each per day to my db
each sensor corresponds to a collection. I have a fixed number of 1000 collections that grow every day by about 100,000 documents each.
1 seems to intuitively be faster for querying. I am using mongoDb 3.4 which has no limit for the number of collections in a db.
2 seems cleaner but I am afraid the collections will become huge and that querying will gradually become slower as each collection grows
I am favoring 1 but I might be wrong. Any advice?
Update:
I followed the advice of
https://bluxte.net/musings/2015/01/21/efficient-storage-non-periodic-time-series-mongodb/
Instead of storing one document per measurement, I have a document containing 128 measurement,startDate,nextDate. It reduces the number of documents and thus the index size but I am still not sure how to organize the collections.
When I query data, I just want the data for a (date,sensor) pair, that is why I thought 1 might speed up the reads. I currently have about 20,000 collections in my DB and when I query the list of all collections, it takes ages which makes me think that it is not a good idea to have so many collections.
What do you think?
I would definitely recommend approach 2, for a number of reasons:
MongoDB's sharding is designed to cope with individual collections getting larger and larger, and copes well with splitting data within a collection across separate servers as required. It does not have the same ability to split data which exists in many collection across different servers.
MongoDB is designed to be able to efficiently query very large collections, even when the data is split across multiple servers, as long as you can pick a suitable shard key which matches your most common read queries. In your case, that would be sensor + date.
With approach 1, your application needs to do the fiddly job of knowing which collection to query, and (possibly) where that collection is to be found. Approach 2, with well-configured sharding, means that the mongos process does that hard work for you
Whilst MongoDB has no limit on collections I tried a similar approach to 2 but moved away from it to a single collection for all sensor values because it was more manageable.
Your planned data collection is significant. Have you considered ways to reduce the volume? In my system I compress same-value runs and only store changes, I can also reduce the volume by skipping co-linear midpoints and interpolating later when, say, I want to know what the value was at time 't'. Various different sensors may need different compression algorithms (e.g. a stepped sensor like a thermostat set-point vs one that represents a continuous quantity like a temperature). Having a single large collection also makes it easy to discard data when it does get too large.
If you can guarantee unique timestamps you may also be able to use the timestamp as the _id field.
When I query the data I m only interested in getting data from a
given sensor on a given day. I don t do any cross sensor queries.
But that's what exactly what Cassandra is good for!
See this article and this one.
Really, in one of our my projects we were stuck with legacy MongoDB and the scenario, similar to yours, with the except of new data amount per day was even lower.
We tried to change data structure, granulate data over multiple MongoDB collections, changed replica set configurations, etc.
But we were still disappointed as data increases, but performance degrades
with the unpredictable load and reading data request affects writing response much.
With Cassandra we had fast writes and data retrieving performance effect was visible with the naked eye. If you need complex data analysis and aggregation, you could always use Spark (Map-reduce) job.
Moreover, thinking about future, Cassandra provides straightforward scalability.
I believe that keeping something for legacy is good as long as it suits well, but if not, it's more effective to change the technology stack.
If I understand right, you plan to create collections on the fly, i.e. at 12 AM you will have new collections. I guess MongoDB is a wrong choice for this. If required in MongoDB there is no way you can query documents across collections, you will have to write complex mechanism to retrieve data. In my opinion, you should consider elasticsearch. Where you can create indices(Collections) like sensor-data-s1-3-14-2017. Here you could do a wildcard search across indices. (for eg: sensor-data-s1* or sensor-data-*). See here for wildcard search.
If you want to go with MongoDB my suggestion is to go with option 2 and shard the collections. While sharding, consider your query pattern so you could get optimal performance and that does not degrade over the period.
Approach #1 is not cool, key to speed up is divide (shard) and rule. What-if number of singal itself reaches 100000.
So place one signal in one collection and shard signals over nodes to speed up read. Multiple collections or signals can be on same node.
How this Will Assist
Usually for signal processing time-span is used like process signal for 3 days, in that case you can parallel read 3 nodes for the signal and do parallel apache spark processing.
Cross-Signal processing: typically most of signal processing algorithms uses same period for 2 or more signals for analysis like cross correlation and as these (2 or more signals) are parallel fetch it'll also be fast and ore-processing of individual signal can be parallelized.

MongoDB: what is faster: single find() query or many find_one()?

I have the following problem connected to the MongoDB database design. Here is my situation:
I have a collection with about 50k documents (15kB each),
every document have a dictionary storing data samples,
my query always gets all the data from the document,
every query uses an index,
the collection have only one index (based on a single datetime field),
in most cases, I need to get data from many documents (typically 25 < N < 100),
it is easier for me to perform many SELECT queries over a single one,
I have a lot of updates in my databases, much less than SELECT ones,
I use the WiredTiger engine (the newest version of MongoDB),
server instance and web application are on the same machine.
I have two possibilities for making a SELECT query:
perform a single query retrieving all documents I am interested in,
perform N queries, everyone gets a single document, where typically 25 < N < 100 (what about a different scenario when 100 < N < 1k or 1k < N < 10k?)
So the question is if there is any additional overhead when I perform many small queries over a single one? In relational databases making many queries is a very bad practice - but in NoSQL? I am asking about a general practice - should I avoid that much queries?
In the documentation, I read that the number of queries is not important but the number of searches over documents - is that true?
Thanks for help ;)
There is a similar question like the one you asked : Is it ok to query mongodb multiple times
IMO, for your use-case i.e. 25<N<100, one should definitely go with batching.
In case of Single queries :
Looping in a single thread will not suffice, you'll have to make parallel requests which would create additional overhead
creates tcp/ip overhead for every request
there is a certain amount of setup and teardown for each query creating and exhausting cursors which would create unnecessary overhead.
As explained in the answer above, there appears be a sweet-spot for how many values to batch up vs. the number of round trips and that depends on your document type as well.
In broader terms, anything 10<N<1000 should go with batching and the remaining records should form part of other batches but querying single document at a time would definitely create unnecessary overhead.
The problem when you perform small queries over one query is network overhead that is the network latency roundtrip.
For a single request in a batch processing it may be not much, but if you make multiple requests like these or use this technique on frontend it will decrease performance.
Also you may need to preprocess the data like sorting aggregating it manually.

Returning aggregations from HBASE data

I have an HBASE table of about 150k rows, each containing 3700 columns.
I need to select multiple rows at a time, and aggregate the results back, something like:
row[1][column1] + row[2][column1] ... + row[n][column1]
row[1][column2] + row[2][column2] ... + row[n][column2]
...
row[1][columnn] + row[2][columnn] ... + row[n][columnn]
Which I can do using a scanner, the issue is, I believe, that the scanner is like a cursor, and is not doing the work distributed over multiple machines at the same time, but rather getting data from one region, then hopping to another region to get the next set of data, and so on where my results span multiple regions.
Is there a way to scan in a distributed fashion (an option, or creating multiple scanners for each region's worth of data [This might be a can of worms in itself]) or is this something that must be done in a map/reduce job. If it's a M/R job, will it be "fast" enough for real time queries? If not, are there some good alternatives to doing these types of aggregations in realtime with a NOSQL type database?
What I would do in such cases is, have another table where I would have the aggregation summaries. That is When row[m] is inserted into table 1 in table 2 against (column 1) (which is the row key of table 2) I would save its summation or other aggregational results, be it average, standard deviation, max, min etc.
Another approach would be to index them into a search tool such as Lucene, Solr, Elastic Search etc. and run aggregational searches there. Here are some examples in Solr.
Finally, Scan spanning across multiple region or M/R jobs is not designed for real time queries (unless the clusters designed in such way, i.e. oversized to data requirements).
Hope it helps.