I have the below output key value pairs after my map function.
["hello"] => 12
["hello"] => 1
["world"] => 23
["world"] => 4
["canada"] => 18
When i use __count as the reduce function, i got the result 5 as below.
System counts every row.
{
"rows": [
{
"key": null,
"value": 5
}
]
}
I use the same map function with __count again.. This time i add group=true to the query. I get the below result. It seems like reduce function works for every grouped key and counts them in itself.
["hello"] => 2
["world"] => 2
["canada"] => 1
I can't understand the mechanism here.. Why the system works like this with and without grouping. If reduce function works for every unique key , Shouldn't the result without grouping be like below?
["hello"] => 1
["hello"] => 1
["world"] => 1
["world"] => 1
["canada"] => 1
With reduce=true&group=false and a _count reduce function you're asking the system to count the total number of entries in the index. Hence, you see the expected result of 5 in your case.
The group=true is a request to apply the reduce function at a per-key level only, and not do the final summation across all entries. As you can see, if you sum the values you get from the group=true case, you end up with the value you get for the group=false case: 2+2+1 = 5.
It gets even more complicated if you emit a vector-valued key, for example where your map says something along the lines of
emit([doc.field1, doc.field2, doc.field3], 1)
Then you can do the grouping at a select level of the precise number of values from the key that you want to group at, using group_level=X. This is often used when dealing with time-series type data, to be able to group per year, or per month or per day. This is explained in depth in the following blog-post:
https://console.bluemix.net/docs/services/Cloudant/blog/mapreduce.html
Related
I have a dataset
1, india, delhi
2, chaina, bejing
3, russia, mosco
2, england, London
When I perform
df.map(rec => (rec.split(",")(0).toInt, rec))
.reduceByKey((x,y)=> y)
.map(rec => rec._2)
.foreach {println }
Above code is returning below output. Usually reducebykey works as accumulated value and current value to sum values of same key, but here how it is working internally. What value x and what value y. And how it is returning y
1, india, delhi
2, chaina, bejing
3, russia, mosco
Re:"What value x and what value y", you can print to see their values. Make sure you check the executor logs and not driver to see this print statement. Moreover run it multiple times to see if they yield same values for x and y everytime. I do not think the order to read the records is guaranteed. It may not be evident with 4 records you are testing with above.
df.map(rec => (rec.split(",")(0).toInt, rec))
.reduceByKey((x,y)=> {println(s"x:$x,y:$y");y})
.map(rec => rec._2)
.foreach {println }
Re:"how it is working internally"
reduceByKey merges values for a Key based on the given function. This function is first run locally on each partition. The output for each partition is then shuffled based on the keys and then another reduce operation happens. This is similar to combiner function in Map-reduce. This helps in less amount of data needed to shuffle.
Generally this is used in place of groupByKey(), which results in shuffling at the beginning and then you get a chance to work on the values for the keys.
Attaching couple of pictures here to demonstrate this.
reduceByKey
groupByKey
I am trying to make some kind of paging. But, I need to do it on a grouped result, because every time I do a page. It is a requirement that all data for a given group is fetched.
Below code:
var erere = dbCtx.StatusViewList
.GroupBy(p => p.TurbineNumber)
.OrderBy(p => p.FirstOrDefault().TurbineNumber)
.Skip(0)
.Take(10)
.ToList();
I have 200k items and the statement above seems to be so slow the connection times out. My best bet is its the orderby that slows it down. Any suggestions how to do this, or how to speed the statement above up?
At your case, grouping on server side is not needed at all, because anyway you will get all data, but with additional overhead on server side. So try another approach:
var groupPage = dbCtx.StatusViewList.Select(x => TurbineNumber)
.Distinct().OrderBy(x => x.TurbineNumber).Skip(40).Take(20).ToList();
var data = dbCtx.StatusViewList.Where(x => groupPage.Contains(x.TurbineNumber))
.ToList().GroupBy(x => x.TurbineNumber).ToList();
The GroupBy needs to visit all elements to group all StatusViews into groups of StatusViews that have equal TurbineNumber.
After that, you take every group, from every group your take the first element and ask for its TurbineNumber, to sort by Turbine Number.
Apparently you take into account that a group of StatusViews might be empty (FirstOrDefault, instead of First), but then again, you assume that FirstOrDefault never returns null.
One of the things that could speed up your query is using the Key of your groups. The Key is the element on which you grouped, in your case the TurbineNumber: All elements in the a group have the same TurbineNumber.
var result = dbCtx.StatusViewList
.GroupBy(statusView => statusView.TurbineNumber)
.OrderBy(group => group.Key)
...
I think that will be a first step to improve performance.
However, you return a fixed number of Groups. Some Groups might be huge, 1000s of elements, some groups might be small: only one element. So the result of one page could be 10 groups, each with 1000 elements, having a total of 10000 elements. It could also be 10 groups, each with 1 element, a total of 10 elements. I'm not sure if this would be the result you want by paging.
Wouldn't you prefer a page that always has the same number of elements, preferably with the same TurbineNumber, If there are not many same TurbineNumbers fill the rest of your page with the next TurbineNumber. If there are too many StatusViews with this TurbineNumber divide them into several pages?
Something like:
TurbineNumber StatusView
4 A
4 B
4 F
5 D
5 K
6 C
6 Z
6 Q
6 W
7 E
To do this, don't GroupBy, use OrderBy and then Skip and Take
IEnumerable<StatusView> GetPage(int pageNr, int pageSize)
{
return dbCtx.StatusViewList
.Orderby(statusView => statusView.TurbineNumber)
.Skip(pageNr * pageSize)
.Take(pageSize)
}
If you create an extra index for TurbineNumber, this will be very fast:
In your DbContext.OnModelCreating(DbModelBuilder modelBuilder):
// Add an extra index on TurbineNumber:
var indexAttribute = new IndexAttribute("TurbineIndex", 0) {IsUnique = false}
var indexAnnotation =new IndexAnnotation(indexAttribute);
modelBuilder.Entity<Statusview>()
.Property(statusView => statusView.TurbineNumber)
.HasColumnAnnotation("MyIndexName", indexAnnotation);
I have RDD of the following structure (RDD[(String,Map[String,List[Product with Serializable]])]):
This is a sample data:
(600,Map(base_data -> List((10:00 01-08-2016,600,111,1,1), (10:15 01-08-2016,615,111,1,5)), additional_data -> List((1,2)))
(601,Map(base_data -> List((10:01 01-08-2016,600,111,1,2), (10:02 01-08-2016,619,111,1,2), (10:01 01-08-2016,600,111,1,4)), additional_data -> List((5,6)))
I want to calculate the number of unique values of the 4th fields in sub-lists.
For instance let's take the first entry. The list is List((10:00 01-08-2016,600,111,1,1), (10:15 01-08-2016,615,111,1,5)). It contains 2 unique values (1 and 5) in the 4th field of sub-lists.
As to the second entry, it also contains 2 unique values (2 and 4), because 2 is repeated twice.
The resulting RDD should be of the format RDD[Map[String,Any]].
I tried to solve this task as follows:
val result = myRDD.map({
line => Map(("id",line._1),
("unique_count",line._2.get("base_data").groupBy(l => l).count(_))))
})
However this code does not do what I need. In fact, I don't know how to properly indicate that I want to group by 4th field...
You are quite close to the solution. There is no need to call groupBy, but you can access the item of the tuples by index, transform the resulting List into a Set and then just return the size of the Set, which corresponds to the number of unique elements:
("unique_count", line._2("base_data").map(bd => bd.productElement(4)).toSet.size)
I use mongo query for calculating sum price for every item.
My query looks like so
$queryBuilder = new Query\Builder($this, $documentName);
$queryBuilder->field('created')->gte($startDate);
$queryBuilder->field('is_test_value')->notEqual(true);
..........
$queryBuilder->map('function() {emit(this.item, this.price)}');
$queryBuilder->reduce('function(item, valuesPrices) {
return {sum: Array.sum(valuesPrices)}
}');
And this works, no problem. But I found that in some cases (approximately 20 cases from 200 results) I have strange result in field sum - instead of sum value I see construction like
[objectObject]444444444444444
4 - is price for item.
I tried to replace reduce block to block like this:
var sum = 0;
for (var i = 0; i < valuesPrices.length; i++) {
sum += parseFloat(valuesPrices[i]);
}
return {sum: sum}
In that case I see NAN value.
I suspected that some data in field price was inserted incorrectly (not as float, but as string, object etc). I tried execute my query from mongo cli and I see that all price values are integer.
It's not "strange" at all. You "broke the rules" and now you are paying for it.
"MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key."
The primary rule of mapReduce (as cited ) is that you must return exactly the same structure from the "reducer" as you do from the "mapper". This is because the "reducer" can actually run several times for the same "key". This is how mapReduce processes large lists.
You fix this by just returning a singular value, just like you did in the emit:
return Array.sum(values);
And then there will not be a problem. Adding an object key to that makes the data inconsistent, and thus you get an error when the "reduced" result gets fed back into the "reducer" again.
The output from MongoDB's map/reduce includes something like 'counts': {'input': I, 'emit': E, 'output': O}. I thought I clearly understand what those mean, until I hit a weird case which I can't explain.
According to my understanding, counts.input is the number of rows that match the condition (as specified in query). If so, how is it possible that the following two queries have different results?
db.mycollection.find({MY_CONDITION}).count()
db.mycollection.mapReduce(SOME_MAP, SOME_REDUCE, {'query': {MY_CONDITION}}).counts.input
I thought the two should always give the same result, independent of the map and reduce functions, as long as the same condition is used.
The map/reduce pattern is like a group function in SQL. So there are grouping some result in one row. So your can't have same number of result.
The count in mapReduce() method is the number of result after the map/reduce function.
By example. You have 2 rows :
{'id':3,'num':5}
{'id':4,'num':5}
And you apply the map function
function(){
emit(this.num, 1);
}
After this map function you get 2 rows:
{5, 1}
{5, 1}
And now you apply your reduce method :
function(k,vals) {
var sum=0;
for(var i in vals) sum += vals[i];
return sum;
}
You have now only 1 row return :
2
Is your server steady-state in between the two calls?