MapReduce in MongoDB doesn't reduce all the k-v pairs with the same key in one go - mongodb

I have imported a db from a csv with has info about:
country
region
commodity
price
date
(This is the csv: https://www.kaggle.com/jboysen/global-food-prices)
the strings in the csv are ordered in this way:
country 1, region 1.1, commodity X, price, dateA
country 1, region 1.1, commodity X, price, dateB
country 1, region 1.1, commodity Y, price, dateA
country 1, region 1.1, commodity Y, price, dateB
...
country 1, region 1.2, commodity X, price, dateA
country 1, region 1.2, commodity X, price, dateB
country 1, region 1.2, commodity Y, price, dateA
country 1, region 1.2, commodity Y, price, dateB
...
country 2, region 2.1, commodity X, price, dateA
...
I need to show, for each country, for each product, the biggest price.
I wrote:
1) a map with key country+commodity and value price
var map = function() {
emit({country: this.country_name, commodity: this.commodity_name}, {price: this.price});
};
2) a reduce that scans the prices related to a key and check what's the highest price
var reduce = function(key, values) {
var maxPrice = 0.0;
values.forEach(function(doc) {
var thisPrice = parseFloat(doc.price);
if( typeof doc.price != "undefined") {
if (thisPrice > maxPrice) {
maxPrice = thisPrice;
}
}
});
return {max_price: maxPrice};
};
3) I send the output of a map reduce to a collection "mr"
db.prices.mapReduce(map, reduce, {out: "mr"});
PROBLEM:
For example, if I open the csv and manually order by:
country (increasing order)
commodity (increasing order)
price (decreasing order)
I can check that (to give an example of data) in Afghanistan the highest price for the commodity Bread is 65.25
When I check the M-R though, it results 0 for max price of Bread in Afghanistan.
WHAT HAPPENS:
There are 10 regions in the csv in which Bread is logged for Afghanistan.
I've added, on the last line of the reduce:
print("reduce with key: " + key.country + ", " + key.commodity + "; max price: " + maxPrice);
Theoretically, if I search in mongodb log, I should only find ONE entrance with "reduce with key: Afghanistan, Bread; max price: ???".
Instead I see TEN lines (same numbers of the regions), each one with a different max price.
The last one has "max price 0".
MY HYPOTESIS:
It seems that, after the emit, when the reduce is called, instead of looking for ALL k-v pairs with the same key, it considers sub-groups that are in promixity.
So, recalling my starting example on the csv structure:
until the reduce scans emit outputs related to "afghanista, region 1, bread", it does a reduce on themm
then it does a reduce on the outputs related to "afghanistan, region 1, commodityX"
then it does another reduce on the outputs related to "afghanistan, region 2, bread" (instead of reducing ALL the k-v pairs with afghanistan+bread in a single reduce)
Do I have to do a re-reduce to work on all the partial reduce jobs?

I've managed to solve this.
MongoDB doesn't necessarily do the reducing of all k-v pairs with the same key in one go.
It can happen that (as in this case) MongoDB will perform a reduce on a subset of k-v pairs related to a specific key, and then it will send the output of this first reduce when it will do a second reduce on another subset related to the same key.
My code didn't work because:
MongoDB performed a reduce on a subset of k-v pairs related to the key "Afghanistan, Bread", with a variable in output named "maxPrice"
MongoDB would proceed to reduce other subsets
MongoDB, when faced with another subset of "Afghanistan, Bread", would take the output of the first reduce, and use it as a value
The output of a reduce is named "maxPrice", but the other values are named "price"
Since I ask for the value "doc.price", when I scan the doc that contains "maxPrice", it gets ignored
There are 2 approaches to solve this:
1) You use the same name for the reduce output variable as the emit output value
2) You index the properties chosen as key, and you use the "sort" option on mapReduce() so that all k-v pairs related to a key get reduced in one go
The second approach is if you don't want to give up using a different name for the name of the reduce output (plus it has better performance since it only does a single reduce per key).

Related

Mongo Db ESR rule for multiple index keys in a compound index

Collection: appointments
Schema:
{
_id: ObjectId();
userId: string;
calType: string;
status: string;
appointment_start_date_time: string; //UTC ISO string
appointment_end_date_time: string; //UTC ISO string
}
Example:
{
_id: ObjectId('6332b21960f8083d24f3140b')
userId: "6272ccb3-4050-429c-b427-eb104f340962"
calType: "MY Personal Cal"
status: "CONFIRMED"
appointment_start_date_time: "2022-07-08T03:30:00.000Z"
appointment_end_date_time: "2022-07-08T04:00:00.000Z"
}
I want to create a compound index on userId, calType, status, appointment_start_date_time
Based on Mongo Db's ESR rule I would like to determine the arrangement of my keys.
The documentation conveniently gives an example of 3 keys in compound index where the first key is for equality, second for sort and third for range. But in my case I have more than 3 keys.
I would like to know how would the index keys be arranged for a more efficient compound index. In my case userId, calType, status will be used for equality based match whereas appointment_start_date_time will be used for sorting.
Potential queries which I will be making on this collection will be:
All appointments where userId = x, calType = y, status = z sort by appointment_start_date_time ASC
All appointments where userId = x, status = z
All appointments where calType = y, status = z
All appointments where userId = x sort by appointment_start_date_time ASC or DSC
What is the standard when we have multiple keys for equality and one for sorting/range?
None of your sample queries use a ranged filter. Assuming none of these fields contain arrays, applying the ESR rule:
Queries 1 and 2 could be optimally served by an index on
{userId:1, status:1, calType:1, appointment_start_date_time:1}
Query 3 would be best server by this index:
{calType:1, status:1}
Query 4 would be best served by:
{userId:1, appointment_start_date_time:1}
In these optimal cases, the MongoDB server could seek to the first matching index key, scan to the last key in a single pass, and encounter the documents in already sorted order.
It may also be possible to get acceptable performance for queries 1,2, and 4 using the index:
{userId:1, appointment_start_date_time:1, status:1, calType:1}
Using this index, query 4 would still be optimal, but query 1 and 2 would require and additional index seek for each status/calType pair. This would be somewhat less performant than the optimal case, but would still be better than an in-memory sort.

Sorting out with highest values first

for (index, item) in users.enumerated(){
//countUserIds[item.name] = [Float(item.id),avgs[index]]
print("Name:",item.name,"ID:",item.id,"Score:",avgs[index])
}
Im trying to sort my output by having the highest averages listed first, but if I sort my array it will not be linked to the right name and ID. How can I sort the averages and display the users with the highest average first?

Sort Ascending and Descending in a List

I have a List which has attributes (StarRating, Price) like the below in Scala:
ListBuffer(RatingAndPriceDataset(3.5,200), RatingAndPriceDataset(4.5,500),RatingAndPriceDataset(3.5,100), RatingAndPriceDataset(3.0,100))
My Sort Priority Order is :
First Sort on the basis of Star rating (descending) and then pick three lowest prices. So in the above example , I get the list as :
RatingAndPriceDataset(4.5,500),RatingAndPriceDataset(3.5,100), RatingAndPriceDataset(3.5,200)
as can be seen the star rating is the one which comes as higher Priority in sort. I tried a few things but failed to do the sort. If i sort on the basis of a star rating then Price, it fails to keep the priority order.
What I get from a calling a method is a list like this (as above). The list will have some data like below (example):
StarRating Price
3.5 200
4.5 100
4.5 1000
5.0 900
3.0 1000
3.0 100
**Expected result:**
StarRating Price
5.0 900
4.5 100
4.5 1000
Using the data in the table you've provided (which is different than the data in your code):
val input = ListBuffer(RatingAndPriceDataset(3.5,200), RatingAndPriceDataset(4.5,100),RatingAndPriceDataset(4.5,1000), RatingAndPriceDataset(5.0, 900), RatingAndPriceDataset(3.0, 1000), RatingAndPriceDataset(3.0, 100))
val output = input.sortBy(x => (-x.StarRating, x.Price)).take(3) // By using `-` in front of `x.StarRating` we're getting descending order of sorting
println(output) // ListBuffer(RatingAndPriceDataset(5.0,900), RatingAndPriceDataset(4.5,100), RatingAndPriceDataset(4.5,1000))

Query Exact Matches in MongoDB

I am working on developing a web application feature that suggests prices for users based on previous orders in the database. I am using the MongoDB NoSQL database. Before I begin, I am trying to figure out the best way to set up the order object to return the correct results.
When a user places an order such as the following: 1 cheeseburger + 1 fry, McDonalds, 12345 E. Street, MyTown, USA... it should only return objects that are EXACT matches from the database.
For example, I would not want to receive an order that contained 1 cheeseburger + 1 fry + 1 shake. I will be keeping running averages of the prices and counts for that exact order.
{
restaurantAddress: "12345 E. Street, MyTown, USA",
restaurantName: "McDonald's",
orders: {
{ cheeseburger: 1, fries: 2 }
: {
sumPaid: 1444.55,
numTimesOrdered: 167,
avgPaid: 8.65 (gets recomputed w/ each new order)
},
{ // repeat for each unique item config },
{ // another unique item (or items) }
}
Do you think this is a valid and efficient way to set up the document in MongoDB? Or should I be using multiple documents?
If this is valid, how can I query it to only return exact orders? I looked into $eq but it did not seem to be exactly what I was looking for.
So I believe we have solved the problem. The solution is to create a string that is unique for the order on the server side. For example, we will write a function that would transform the 1 cheeseburger + 2 fries into burger1fries2. In order to keep consistency in the database, we will first sort the entries alphabetically, so we will always hit what we intended with the query. A similar order of 2 fries + 1 cheeseburger would generate the string burger1fries2 as well.

MongoDb: how to create the right (composite) index for data with many searchable fields

UPDATE: I need to add that the point of this question is to allow me to define schemas for Json Rest Stores. The user can search by any one key, or several keys. So, I cannot easily predict what the users will search by -- it could be 1, 2, 5 fields (this is especially true for data-rich fields like people, bookings, etc.)
Imagine that I have an index as such:
{ "item": 1, "location": 1, "stock": 1 }
Following the MongoDb manual on indexes:
MongoDB can use this index to support queries that include:
the item field,
the item field and the location field,
the item field and the location field and the stock field, or
only the item and stock fields; however, this index would be less efficient than an index on only item and stock.
MongoDB cannot use this index to support queries that include:
only the location field,
only the stock field, or
only the location and stock fields.
Now, suppose I have a schema with exactly these fields:
item: String
location: String
stock: String
qty: number
And imagine I want to make sure every query is indeed indexed. I would do:
For item:
item, location, stock, qty
item, location, qty, stock
item, stock, qty, location
item, stock, location, qty
item, qty, location, stock
item, qty, stock, location
For location:
...you know the gist
Now... this seems a little insane. If you have a database where you have TEN searchable fields, this becomes clearly unworkable as the number of indexes grows exponentially.
Am I missing something? My idea was to define a schema, define which fields were searchable, and write a function that makes up all of the needed indexes regardless of what fields were present and what fields weren't. However, I am thinking about it, and... well, I must be missing something.
Am I?
I will try to explain what does this mean by example. The indexes based on B-tree is not something mongodb specific. In contrast it is rather common concept.
So when you create an index - you show the database an easier way to find something. But this index is stored somewhere with a pointer pointing to a location of the original document. This information is ordered and you might look at it as binary tree which has a really nice property: the search is reduced from O(n) (linear scan) to O(log(n)). Which is much much faster because each time we trim our space in half (potentially we can reduce the time from 10^6 to 20 lookups). For example we have a big collection with field {a : some int, b: 'some other things'} and if we index it by a, we end up with another data structure which is sorted by a. It looks this way (by this I do not mean that it is another collection, this is just for demonstration):
{a : 1, pointer: to the field with a = 1}, // if a is the smallest number in the starting collection
...
{a : 999, pointer: to the field with a = 990} // assuming that 999 is the biggest field
So right now we are searching for a field a = 18. Instead of going one by one through all elements we take something in the middle and if it is bigger then 18, then we are dividing the lower part in half and checking the element there. We continue till we will find a = 18. Then we look at the pointer and knowing it we extract the original field.
The situation with compound index is similar (instead of ordering by one element we order by many). For example you have a collection:
{ "item": 5, "location": 1, "stock": 3, 'a lot of other fields' } // was stored at position 5 on the disk
{ "item": 1, "location": 3, "stock": 1, 'a lot of other fields' } // position 1 on the disk
{ "item": 2, "location": 5, "stock": 7, 'a lot of other fields' } // position 3 on the disk
... huge amount of other data
{ "item": 1, "location": 1, "stock": 1, 'a lot of other fields' } // position 9 on the disk
{ "item": 1, "location": 1, "stock": 2, 'a lot of other fields' } // position 7 on the disk
and want an index { "item": 1, "location": 1, "stock": 1 }. The lookup table would look like this (one more time - this is not another collection, this is just for demonstration):
{ "item": 1, "location": 1, "stock": 1, pointer = 9 }
{ "item": 1, "location": 1, "stock": 2, pointer = 7 }
{ "item": 1, "location": 3, "stock": 1, pointer = 1 }
{ "item": 2, "location": 5, "stock": 7, pointer = 3 }
.. huge amount of other data (but not necessarily here. If item would be one it would be somewhere next to items 1)
{ "item": 5, "location": 1, "stock": 3, pointer = 5 }
See that here everything is basically sorted by item, then by location and then by pointer.
The same way as with a single index we do not need to scan everything. If we have a query which looks for item = 2, location = 5 and stock = 7 we can quickly identify where documents with item = 2 are and then the same way quickly identify where among these items item with location 5 and so on.
And right now an interesting part. Also we created just one index (although this is a compound index, it is still one index) we can use it to quickly find the element
only by the item. Really all we need to do is only the first step. So there is no point to create another index {location : 1} because it is already covered by compound index.
also we can quickly find only by item and by location (we need only 2 steps).
Cool 1 index but helps us in three different ways. But wait a minute: what if we want to find by item and stock. Oh it looks like we can speed up this query as well. We can in log(n) find all elements with specific item and ... here we have to stop - magic has finished. We need to iterate through all of them. But still pretty good.
But may it can help us with other queries. Lets look at a query by location which looks like was already ordered. But if you will look at it - you see that this is a mess. One in the beginning and then one in the end. It can not help you at all.
I hope this clarifies few things:
why indexes are good (reduce time from O(n) to potentially O(log(n))
why compound indexes can help with some queries nonetheless we have not created an index on that particular field and help with some other queries.
what indexes are covered by compound index
why indexes can harm (it creates additional datastructure which should be maintained)
And this should tell another valid thing: index is not a silver bullet. You can not speed up all your queries, so it sound silly to think that by creating indexes on all fields EVERYTHING would be super fast.
What are your real query patterns? It's very unlikely that you would need to create all of these possible index combinations. I also doubt that including qty in the index would be of much use. Do you need to search for things where qty == 4 independent of location and item type?
An index doesn't need to identify every single record, it just needs to be specific enough to make any final scan small. Given an item code or a stock value are there really that many locations that you'd also need to index on them?
I suspect in this case an index on item, an index on location and and index on stock would be sufficient to answer most likely queries with sufficient speed. (But we'd need to know more about what these field names mean and what the count and distribution of values is within them).
Use explain with your queries and you can see how well they are performing. Add indices as necessary, don't create every possible ordering.