How to use MapReduce for k-Means Spatial Clustering - mongodb

I'm new to mongodb and map-reduce and want to evaluate spatial data by using a k-means spatial clustering. I found this article which seems to be a good description of the algorithm, but I have no clue how to translate this into a mongo shell script. Assume my data looks like:
{
_id: ObjectID(),
loc: {x: <longitude>, y: <latitude>},
user: <userid>
}
And I can use { k = sqrt(n/2) } where n is the number of samples.
I can use aggregates to get the bounding extents of the data and the count, etc.
I kind of got lost with the reference to a file of the cluster points, which I assume would be just another collection and I have no idea how to do the iteration or if that would be done in the client or the database?
Ok, I have made a little progress on this in that I have generated and array of initial random points that I need to compute the sum of least squares against during the map-reduce phase, but I do not know how to pass these to the map function. I took a stab at writing the map function:
var mapCluster = function() {
var key = -1;
var sos = 0;
var pos;
for (var i=0; i<pts.length; i++) {
var dx = pts[i][0] - this.arguments.pos[0];
var dy = pts[i][1] - this.arguments.pos[1];
var sumOfSquare = dx*dx + dy*dy;
if (i == 0 || sumOfSquares < sos) {
key = i;
sos = sumOfSquares;
pos = this.arguments.pos;
}
}
emit(key, pos);
};
I this case the cluster points are like, which is probably will not work:
var pts = [ [x,y], [x1,y1], ... ];
So for each mr iteration, we compare all the collection points against this array and emit the index of the point that we are closest to along with location of the collection point then in the reduce function the average of the points associated with each index would be used to create the new cluster point location. Then in the finialize function I can update the cluster document.
I assume I could do a findOne() on the cluster document to load the cluster points in the map function but do we want to load this document on every call to map? or is there a way to load it once for each iteraction?
So it looks like you can do the above using the scope variable like this:
db.main.mapReduce( mapCluster, mapReduce, { scope: { pnts: pnts, ... }} );
You have to be careful about variable names in the scope as these are placed in the scope of the map, reduce and finialize functions they can collide with existing variable names.

What have you tried?
Note that you will need more than one round of mappers.
With the canonical approach of running k-means on MR, you need one mapper/reducer per iteration.
So, can you try to write the map and reduce steps of a single iteration only?

Related

How can you access an element in a list with over 10k values for a dataset in code.org

so I am trying to recreate a wordle game inside of code.org using the wordle dataset.
//Getting Wordle Answer
var answers = getColumn("Wordle", "validWordleAnswer");
var letters = ["letter1", "letter2", "letter3", "letter4", "letter5"];
var index = (randomNumber(0, answers.length));
console.log(index);
So ofcourse the console.log outputs my index number, but out of the 10k words that are stored in the dataset, how can I access the specific one that the randomnumber generated? Foe example the index was 500, what can I write inorder to console.log the correct answer? Thanks?
The length of the array doesn't matter, 1 element or 20 billion. You still access it with bracket notation:
console.log(answers[index]);
Here's a small example (remember, indexing starts at 0):
const answers = ["chewy", "rocks", "piano", "forte", "phone"];
const index = Math.floor(Math.random() * answers.length);
console.log(index);
console.log(answers[index]);

Looking for advice on improving a custom function in AnyLogic

I'm estimating last mile delivery costs in an large urban network using by-route distances. I have over 8000 customer agents and over 100 retail store agents plotted in a GIS map using lat/long coordinates. Each customer receives deliveries from its nearest store (by route). The goal is to get two distance measures in this network for each store:
d0_bar: the average distance from a store to all of its assigned customers
d1_bar: the average distance between all customers common to a single store
I've written a startup function with a simple foreach loop to assign each customer to a store based on by-route distance (customers have a parameter, "customer.pStore" of Store type). This function also adds, in turn, each customer to the store agent's collection of customers ("store.colCusts"; it's an array list with Customer type elements).
Next, I have a function that iterates through the store agent population and calculates the two average distance measures above (d0_bar & d1_bar) and writes the results to a txt file (see code below). The code works, fortunately. However, the problem is that with such a massive dataset, the process of iterating through all customers/stores and retrieving distances via the openstreetmap.org API takes forever. It's been initializing ("Please wait...") for about 12 hours. What can I do to make this code more efficient? Or, is there a better way in AnyLogic of getting these two distance measures for each store in my network?
Thanks in advance.
//for each store, record all customers assigned to it
for (Store store : stores)
{
distancesStore.print(store.storeCode + "," + store.colCusts.size() + "," + store.colCusts.size()*(store.colCusts.size()-1)/2 + ",");
//calculates average distance from store j to customer nodes that belong to store j
double sumFirstDistByStore = 0.0;
int h = 0;
while (h < store.colCusts.size())
{
sumFirstDistByStore += store.distanceByRoute(store.colCusts.get(h));
h++;
}
distancesStore.print((sumFirstDistByStore/store.colCusts.size())/1609.34 + ",");
//calculates average of distances between all customer nodes belonging to store j
double custDistSumPerStore = 0.0;
int loopLimit = store.colCusts.size();
int i = 0;
while (i < loopLimit - 1)
{
int j = 1;
while (j < loopLimit)
{
custDistSumPerStore += store.colCusts.get(i).distanceByRoute(store.colCusts.get(j));
j++;
}
i++;
}
distancesStore.print((custDistSumPerStore/(loopLimit*(loopLimit-1)/2))/1609.34);
distancesStore.println();
}
Firstly a few simple comments:
Have you tried timing a single distanceByRoute call? E.g. can you try running store.distanceByRoute(store.colCusts.get(0)); just to see how long a single call takes on your system. Routing is generally pretty slow, but it would be good to know what the speed limit is.
The first simple change is to use java parallelism. Instead of using this:
for (Store store : stores)
{ ...
use this:
stores.parallelStream().forEach(store -> {
...
});
this will process stores entries in parallel using standard Java streams API.
It also looks like the second loop - where avg distance between customers is calculated doesn't take account of mirroring. That is to say distance a->b is equal to b->a. Hence, for example, 4 customers will require 6 calculations: 1->2, 1->3, 1->4, 2->3, 2->4, 3->4. Whereas in case of 4 customers your second while loop will perform 9 calculations: i=0, j in {1,2,3}; i=1, j in {1,2,3}; i=2, j in {1,2,3}, which seems wrong unless I am misunderstanding your intention.
Generally, for long running operations it is a good idea to include some traceln to show progress with associated timing.
Please have a look at above and post results. With more information additional performance improvements may be possible.

MongoDB Query advice for weighted randomized aggregation

By far I have encountered ways for selecting random documents but my problem is a bit more of a pickle.So here goes
I have a collection which contains say a 1000+ documents (products)
say each document has a more or less generic format of .Say for simplicity it is
{"_id":{},"name":"Product1","groupid":5}
The groupid is a number say between 1 to 20 denoting the product belongs to that group.
Now if my query input is something like an array of {groupid->weight} for eg {[{"2":4},{"7":6}]} and say another parameter n(=10 say) Then I need to be able to pick 4 random documents that belong to groupid 2 and 6 random documents that belong to groupid 7.
The only solution i can think of is to run 'm' subqueries where m is the array length in the query input.
How do I accomplish this an efficient manner in MongoDB using probably a Mapreduce.
Picking up n random documents for each group.
Group the records by the groupid field. Emit the groupid as key
and the record as value.
For each group pick n random documents from the values array.
Let,
var parameter = {"5":1,"6":2}; //groupid->weight, keep it as an Object.
be the input to the map reduce functions.
The map function, emit only those group ids which we have provided as the parameter.
var map = function map(){
if(parameter.hasOwnProperty(this.groupid)){
emit(this.groupid,this);
}
}
The reduce function, for each group, get random records based on the parameter object in scope.
var reduce = function(key,values){
var length = values.length;
var docs = [];
var added = [];
var i= 1;
while(i<=parameter[key]){
var index = Math.floor(Math.random()*length);
if(added.indexOf(index) == -1){
docs.push(values[index]);
added.push(index);
i++;
}
else{
i--;
}
}
return {result:docs};
}
Invoking map reduce on the collection, by passing the parameter object in scope.
db.collection.mapReduce(map,
reduce,
{out: "sam",
scope:{"parameter":{"5":1,"6":2,"n":10}}})
To get the dumped output:
db.sam.find({},{"_id":0,"value.result":1}).pretty()
When you bring the parameter n into picture, you need to specify the number of documents for each group as a ratio, or else that parameter is not necessary at all.

how to get a parentNode's index i using d3.js

Using d3.js, were I after (say) some value x of a parent node, I'd use:
d3.select(this.parentNode).datum().x
What I'd like, though, is the data (ie datum's) index. Suggestions?
Thanks!
The index of an element is only well-defined within a collection. When you're selecting just a single element, there's no collection and the notion of an index is not really defined. You could, for example, create a number of g elements and then apply different operations to different (overlapping) subsets. Any individual g element would have several indices, depending on the subset you consider.
In order to do what you're trying to achieve, you would have to keep a reference to the specific selection that you want to use. Having this and something that identifies the element, you can then do something like this.
var value = d3.select(this.parentNode).datum().x;
var index = -1;
selection.each(function(d, i) { if(d.x == value) index = i; });
This relies on having an attribute that uniquely identifies the element.
If you have only one selection, you could simply save the index as another data attribute and access it later.
var gs = d3.selectAll("g").data(data).append("g")
.each(function(d, i) { d.index = i; });
var something = gs.append(...);
something.each(function() {
d3.select(this.parentNode).datum().index;
});

Random Sampling from Mongo

I have a mongo collection with documents. There is one field in every document which is 0 OR 1. I need to random sample 1000 records from the database and count the number of documents who have that field as 1. I need to do this sampling 1000 times. How do i do it ?
For people coming to the answer, you should now use the new $sample aggregation function, new in 3.2.
https://docs.mongodb.org/manual/reference/operator/aggregation/sample/
db.collection_of_things.aggregate(
[ { $sample: { size: 15 } } ]
)
Then add another step to count up the 0s and 1s using $group to get the count. Here is an example from the MongoDB docs.
For MongoDB 3.0 and before, I use an old trick from SQL days (which I think Wikipedia use for their random page feature). I store a random number between 0 and 1 in every object I need to randomize, let's call that field "r". You then add an index on "r".
db.coll.ensureIndex(r: 1);
Now to get random x objects, you use:
var startVal = Math.random();
db.coll.find({r: {$gt: startVal}}).sort({r: 1}).limit(x);
This gives you random objects in a single find query. Depending on your needs, this may be overkill, but if you are going to be doing lots of sampling over time, this is a very efficient way without putting load on your backend.
Here's an example in the mongo shell .. assuming a collection of collname, and a value of interest in thefield:
var total = db.collname.count();
var count = 0;
var numSamples = 1000;
for (i = 0; i < numSamples; i++) {
var random = Math.floor(Math.random()*total);
var doc = db.collname.find().skip(random).limit(1).next();
if (doc.thefield) {
count += (doc.thefield == 1);
}
}
I was gonna edit my comment on #Stennies answer with this but you could also use a seprate auto incrementing ID index here as an alternative if you were to skip over HUGE amounts of record (talking huge here).
I wrote another answer to another question a lot like this one where some one was trying to find nth record of the collection:
php mongodb find nth entry in collection
The second half of my answer basically describes one potential method by which you could approach this problem. You would still need to loop 1000 times to get the random row of course.
If you are using mongoengine, you can use a SequenceField to generate an incremental counter.
class User(db.DynamicDocument):
counter = db.SequenceField(collection_name="user.counters")
Then to fetch a random list of say 100, do the following
def get_random_users(number_requested):
users_to_fetch = random.sample(range(1, User.objects.count() + 1), min(number_requested, User.objects.count()))
return User.objects(counter__in=users_to_fetch)
where you would call
get_random_users(100)