MongoDB optimization aggregation - mongodb

I have installed some days ago MongoDB on my computer to do some test for a work, in detail we have to transfer a huge quantity of data from a Postgres based system to a MongoDB one.
Because we don't know MongoDB (first time we use it) we tried to study the documentation and we did some tests on a little DB with few data to test the performance...
After many test at this time we have still a worsening...
However now I'll explain the context so maybe somebody could tell me if we did something wrong or not.
We know which are the more "problematic" queries and I will wrote here one of them, in Postgres the query is something like this (I'll cut the unnecessary):
selectStmt varchar = 'SELECT station.radarmeteo_id,
date(datetime_range) AS datetime_range,
district.name AS district,
city.name AS city,
min_temperature::real / 10::real,
max_temperature::real / 10::real,
rainfall_daily::real / 10::real,
max_wind_speed::real / 10::real,
extract(epoch FROM datetime_range) as unix_datetime ';
fromStmt varchar = ' FROM measurement_daily
INNER JOIN station ON measurement_daily.station_id = station.id;
In MongoDB we wrote this:
db.measurement_daily.aggregate([{"$match":{"min_temperature":{"$gt":random.randint(-30, 14), "$lt":random.randint(18, 50)}}},{"$lookup":{"from":"station","localField":"station_id", "foreignField":"_id", "as": "scd"}},{"$unwind":"$scd"},{"$project":{"_id":1,"min_temperature":1,"max_temperature":1, "rainfall_daily":1, "max_wind_speed":1, "radarmeteo_id":"$scd.radarmeteo_id", "city_name":"$scd.city_name", "district_name":"$scd.district_name"}},{"$out":"result"}])
What I am asking here is: it should be written better? Or there could be a better way to have the same result? Is there any other optimization we can use to?
We need the best response time because the real DB should have something like 200.000.000 of data only in this collection...
And just here with 2 tables with 1000 (station) and 6400 (measurement_daily) records/documents respectively, we have 3,5-4s (Postgres) vs 30-32s (MongoDB) as response time...
(To test the performance in both the systems the query is repeated 200 times (that's why we have 3,5-4s and 30-32s for one query respectively) to have an "homogeneous" response time to minimize conditioning by external causes.)
Any help is really appreciated...

According to mongoDB documentation When a $unwind immediately follows another $lookup, and the $unwind operates on the as field of the $lookup, the optimizer can coalesce the $unwind into the $lookup stage. This avoids creating large intermediate documents.
In your case it will look like:
"$lookup": {
"from":"station",
"localField":"station_id",
"foreignField":"_id",
"as": "scd"
unwinding: { preserveNullAndEmptyArrays: false }
}

Related

MongoDB speed query when searching by part of text

I have smallish size database with 2 million records of phone calls.
When I execute:
db.getCollection('calls').find({
'IsIncoming':true, 'DateCreated' : { '$gte':ISODate('2010-12-02T02:26:22.478Z') }, 'CallerIdNum':"2545874578"
}).limit(100).count({})
it is supper fast and it takes 95ms. Note that IsIncoming, DateCreted and CallerIdNum have indexes. Every time I search using those fields it is supper fast
The moment I search for something containing part of a text it is very slow. For example this query now takes 25 seconds:
db.getCollection('calls').find({
'IsIncoming':true, 'DateCreated' : { '$gte': ISODate('2010-12-02T02:26:22.478Z') }, 'CallerIdNum' : /2545874/
}).limit(100).count({})
I know the reason is because I am searching within CallerIdNum. If I where to know the full caller id in advance like on my first query then it will be fast.
Question
I will like the last query to execute faster. I know it is probably impossible, and the only way of getting a great performance is by searching by the whole CallerIdNum. But maybe/hopefully I am wrong and someone can help me find a way of executing my last query faster.
The problem here is that you are searching for a substring of a caller ID number /2545874/. The is not sargable, and generally can't use an index. Assuming you really want numbers which start with that prefix, then use this sargable version:
db.getCollection('calls').find({
'IsIncoming':true, 'DateCreated' : { '$gte': ISODate('2010-12-02T02:26:22.478Z') }, 'CallerIdNum' : /^2545874/
}).limit(100).count({})
You might also want to add a compound index on all three fields, though at least the version of the query I gave above can use an index involving the CallerIdNum field.

Pymongo - find multiple different documents

my question is very similar to how-to-get-multiple-document-using-array-of-mongodb-id, however, I would like to find multiple documents without using the _id.
That is, consider that I have documents such as
the
document = { _id: _id, key_1: val_1, key_2: val_2, key_3: val_3}
I need to be able to .find() by multiple parameters, as for example,
query_1 = {key_1: foo, key_2: bar}
query_2 = {key_1: foofoo, key_2: barbar}
Right now, I am running a query for query_1, followed by a query for query_2.
As it turns out, this method is extremely inefficient.
I tried to add concurrency as to make it faster, but the speedup was not even 2x.
Is it possible to query multiple documents at once?,
I am looking for a method that returns the union of the matches for query_1 AND query_2.
If this is not possible, do you have any suggestions that might speed a query of this type?
Thank you for your help.

Query one document per association from MongoDB

I'm investigating how MongoDB would work for us. One of the most used queries is used to get latest (or from a given time) measurements for each station. There is thousands of stations and each station has tens of thousands of measurements.
So we plan to have one collection for stations and another for measurements.
In SQL we would do the query with
SELECT * FROM measurements
INNER JOIN (
SELECT max(meas_time) station_id
FROM measurements
WHERE meas_time <= 'time_to_query'
GROUP BY station_id
) t2 ON t2.station_id = measurements.station_id
AND t2.meas_time = measurements.meas_time
This returns one measurement for each station, and the measurement is the newest one before the 'time_to_query'.
What query should be used in MongoDB to produce the same result? We are really using Rails and MongoId, but it should not matter.
update:
This question is not about how to perform a JOIN in MongoDB. The fact that in SQL getting the right data out of the table requires a join doesn't necessary mean that in MongoDB we would also need a join. There is only one table used in the query.
We came up with this query
db.measurements.aggregate([{$group:{ _id:{'station_id':"$station_id"}, time:{$max:'$meas_time'}}}]);
with indexes
db.measurements.createIndex({ station_id: 1, meas_time: -1 });
Even though it seems to give the right data it is really slow. Takes roughly a minute to get a bit over 3000 documents from a collection of 65 million.
Just found that MongoDB is not using the index in this query even though we are using the 3.2 version.
I guess worst case solution would be something like this (out of my head):
meassures = []
StationId.all.each do |station|
meassurement = Meassurment.where(station_id: station.id, meas_time <= 'time_to_query').order_by(meas_time: -1).limit(1)
meassures << [station.name, meassurement.measure, ....]
end
It depends on how much time query can take. Data should anyway be indexed by station_id and meas_time.
How much time does the SQL query take?

mongodb mapreduce groupby twice

I am new to mongodb and try to count how many distinct login users per day from existing collection. The data in collection looks like following
[{
_id: xxxxxx,
properties: {
uuid: '4b5b5c2e208811e3b5a722000a97015e',
time: ISODate("2014-12-13T00:00:00Z"),
type: 'login'
}
}]
Due to my limited knowledge, what I figure out so far is group by day first, output the data to a tmp collection and use this tmp collection to do anther map reduce and output the result to a final collection. This solution will get my collections bigger which I do not really like it. Does anyone can help me out or any good/more complex tutorials that I can follow? thanks
Rather than a map reduce, I would suggest an Aggregation. You can think of an aggregation as somewhat like a linux pipe, in that you can pass the results of one operation to the next. With this strategy, you can perform 2 consecutive groups and never have to write anything to the database.
Take a look at this question for more details on the specifics.

How to make distinct operation more quickly in mongodb

There are 30,000,000 records in one collection.
when I use distinct command on this collection by java, it takes about 4 minutes, the result's count is about 40,000.
Is mongodb's distinct operation so inefficiency?
and how can I make it more efficient?
Is mongodb's distinct operation so inefficiency?
At 30m records? I would say 4 minutes is actually quite good, I think that's just as fast, maybe a little faster than SQL does it.
I would probably test this in other databases before saying it is inefficient.
However, one way of looking at performance is to see if the field is indexed first and if that index is in RAM or can be loaded without page thrashing. Distinct() can use an index so long as the field has an index.
and how can I make it more efficient?
You could use a couple of methods:
Incremental map reduce to distinct the main collection once every, say, 5 mins to a unique collection
And Pre-aggregate the unique collection on save by saving to two collections, one detail and one unique
Those are the two most viable methods of getting around this performantly.
Edit
Distinct() is not outdated and if it fits your needs is actually more performant than $group since it can use an index.
The .distinct() operation is an old one, as is .group(). In general these have been superseded by .aggregate() which should be generally used in preference to these actions:
db.collection.aggregate([
{ "$group": {
"_id": "$field",
"count": { "$sum": 1 }
}
)
Substituting "$field" with whatever field you wish to get a distinct count from. The $ prefixes the field name to assign the value.
Look at the documentation and especially $group for more information.