mongodb linear combination of values for sorting - mongodb

I'm using mongodb (any version, I'd happily upgrade to make this work as I'm asking). I have a large collection (~50 million documents) with several numeric fields, e.g.:
{_id: 'doc1', a: 5, b: 2, c: 0},
{_id: 'doc2', a: 4, b: 9, c: 6},
{_id: 'doc3', a: 1, b: 7, c: 4},
{_id: 'doc4', a: 8, b: 1, c: 1},
...
I'd like to sort this collection by various linear combinations of a, b, and c (in reality there are something like 20 fields I'd like to combine). So for example I'd like to sort by 3*a + 4*b + 10*c. The weights (3, 4 and 10 in my example) are something I'd like to experiment with rapidly.
I didn't see a simple way for indexes to support efficient sorting on linear combinations of fields. I know I can do this with the aggregation pipeline, but I think it will still require a collection scan for every new set of weights (?).
If I were implementing mongodb I can imagine that perhaps I could compute a index on the expression 3*a + 4*b + 10*c efficiently by using indices on a, b and c, i.e. I wouldn't need to do a full collection scan to compute a new linearly weighted index. I'm not sure if that's true theoretically, and if it translates into anything I can do practically to solve this problem.
Any input is welcome!

Related

How does sortWith in Scala work in terms of iterating a tuple?

A list can be iterated as follows:
scala> val thrill = "Will" :: "fill" :: "until" :: Nil
val thrill: List[String] = List(Will, fill, until)
scala> thrill.map(s => s + "y")
val res14: List[String] = List(Willy, filly, untily)
The above code first creates a list and then the second command creates a map with an iterable called as 's', the map creates a new string from 's' by appending the char 'y'.
However, I do not understand the following iteration process:
scala> thrill.sortWith((s,t) => s.charAt(0).toLower < t.charAt(0).toLower)
val res19: List[String] = List(fill, until, Will)
Does the tuple (s,t) take two elements of thrill at once and compares them? How is sorting performed exactly using this syntax/function?
Sorting is arranging the data in ascending or descending order. Sorted data helps us searching easily.
Scala uses TimSort, which is a hybrid of Merge Sort and Insertion Sort.
Here is signature of sortWith function in scala -
def sortWith(lt: (A, A) => Boolean): Repr
The sortWith function Sorts this sequence according to a comparison function. it takes a comparator function and sort according to it.you can provide your own custom comparison function.
It will perform Tim Sort on the input list to result sorted output.
Yes the function compares two elements at a time to determine their ordering.
The implementation of sortWith falls back on Java's Array.sort (docs) with the following signature
sort(T[] a, Comparator<? super T> c)
which takes in an Array of T and a comparator, equipped with a binary function that can compare any two elements of the set of possible values of T and give you some information about how they relate to each other (as in should come before, same or after based on your function).
Behind the scenes it is really just an iterative merge sort details of which can be found on the wikipedia page for merge sort
Leaving the optimizations aside, if you aren't familiar with it, merge sort, which is a divide and conquer algorithm, simply breaks down the collection you have until you have groups of 2 elements and then merges the smaller lists in a sorted way simply by comparing 2 elements at a time, which is where the binary comparison function you pass in comes to play.
i.e.
List(4, 11, 2, 1, 9, 0) //original list
Break down to chunks:
[4, 11], [2, 1], [9, 0]
Sort internally:
[4, 11], [1, 2], [0, 9]
Merge:
[1, 4, 11], [2], [0, 9]
[1, 2, 4, 11], [0, 9]
[0, 1, 2, 4, 11], [9]
[0, 1, 2, 4, 9, 11]
P.S. a nit picky detail, sortedWith takes in a Function2 (a function of arity 2). This function you pass in is used to generate what we call an Ordering in Scala, which is then implicity converted to a Comparator. Above, I have linked to the implementation of sorted which is what sortedWith calls once it generates this ordering and which is where most of the sorting logic happens.

Mongodb Hashed Sharding

If I choose {a:1,b:1,c:1} as my shard key and in my query I filter {a:1} in a hashed sharding strategy , is the query a targeted operation or it is broadcasting to every shard in the cluster?
If it is targeted operation how mongodb determine it? as hash of {a:1} is completely differ from hash of {a:1,b:1,c:1}
The simple answear is: Yes.
Look at it this way:
Let's assume you have got the following collection:
//1
{
a: 1,
b: 1,
c: 1,
d: 1
},
//2
{
a: 1,
b: 1,
c: 1,
d: 2
},
//3
{
a: 1,
b: 1,
c: 2,
d: 5
}
According to your index, docs 1 and 2 must be at the same bulk (let's say, on shard number 1) while doc 3 could be stored on a different bulk (let's say, on shard number 2)
Now, if you search for {a: 1}, all three docs should appear. Meaning that mongo had to distribute the que both to shard no.1 and shard no.2.
As for your second question, in MongoDb, you cannot perform Compound-Hashed-Index at all (and even if you could, than... yes. The hashed value would probably be diff)

Create Spark dataset with parts of other dataset

I'm trying to create a new dataset by taking intervals from another dataset, for example, consider dataset1 as input and dataset2 as output:
dataset1 = [1, 2, 3, 4, 5, 6]
dataset2 = [1, 2, 2, 3, 3, 4, 4, 5, 5, 6]
I managed to do that using arrays, but for mlib a dataset is needed.
My code with array:
def generateSeries(values: Array[Double], n: Int): Seq[Array[Float]] = {
var res: Array[Array[Float]] = new Array[Array[Float]](m)
for(i <- 0 to m-n){
res :+ values(i to i + n)
}
return res
}
FlatMap seems like the way to go, but how a function can search for the next value in the dataset?
The problem here is that an array is in no way similar to a DataSet. A DataSet is unordered and has no indices, so thinking in terms of arrays won't help you. Go for a Seq and treat it without using indices and positions at all.
So, to represent an array-like behaviour on a DataSet you need to create your own indices. This is simply done by pairing the value with the position in the "abstract array" we are representing.
So the type of your DataSet will be something like [(Int,Int)], where the first is the index and the second is the value. They will arrive unordered, so you will need to rework your logic in a more functional way. It's not really clear what you're trying to achieve but I hope I gave you an hint. Otherwise explain better the expected result in the comment to my answer and I will edit.

Trying to do an upsert with a $push in the most efficient manner using pymongo

I have a search query, that if it finds a match, I would like to push onto its 'vals' array, if it does not find a match then I would like to do an insert with the search query along with the newVals array
findDict = {a: 100, b: 250, c: 110}
newVals = [{x: 1, y: 2}, {x: 4, y:7]}
collection.update(findDict,{'$push': {'vals': newVals}}, upsert = True)
In this example above, if a match was found for findDict, then newVals would be pushed onto the existing vals array for the matching record.
If no match is found, I would like it to create a new record that looks like this:
{a: 100, b: 250, c: 110, vals: [{x: 1, y: 2}, {x: 4, y:7]}
I have to do this several million times, so I'm hoping to do it in the most optimal way. I also have many threads coming in and doing this at once so have to worry about concurrency. The update statement posted above almost seems to work, but it creates an entry like this for some reason if no match is found:
{a: 100, b: 250, c: 110, vals: [ [ {x: 1, y: 2}, {x: 4, y:7 ] ]}
note the array inside the array...
I currently have a unique combined index on a,b, and c. This can be changed if it will help somehow. I think I could do an update with upsert set to False, followed by an insert which will fail if the unique index exists... but it seems I would be doing each search twice in that case and killing my efficiency.
Have you tried using $push with $each?
collection.update(
findDict,
{'$push': {'vals': {'$each': newVals}}},
upsert = True
)

Mongo indexes in detail

I have mongo db collection which looks like the following
collection {
X: 1,
Y: 2,
Z: 3,
T_update: 123,
T_publish: 243,
T_insert: 342
}
I have to create an index like
{X: 1, Y: 1, Z: 1, T_update: 1}
{X: 1, Y: 1, Z: 1, T_publish: 1}
{X: 1, Y: 1, Z: 1, T_insert: 1}
But what I see is that the value X: 1, Y:1, Z:1 will lead to redundancy and only time paramter which I intend to use for sorting is changing. Is there any better way to create the above indexes so that I do not ave to create three separate indexes.
Also say if I have index like
{X: 1, Y: 1, Z: 1, T_update: 1}
and I want Mongo to return result such that x = 5, y = any value, Z = 4, sort = T_update
will the above index be useful or should I create an index such as
{X:1, Z:1, T_update: 1},
I hope that I can avoid it.
The answer here is going to depend on the selectivity of the fields you are indexing - if the criteria you will be using to filter X, Y, or Z are not very selective then they can essentially be left out (or moved to the right of the compound key).
Let's say you are using a filter like Y is not equal to 1, where 1 is a rare value. Since you will be traversing almost the entire index to return most of the values, and scanning the data, having an index on Y will be of less benefit than having an index for the sort first. Given that scenario, if sorting on T_Update it would probably be beneficial to have an index like: {T_update: 1, Y : 1}.
In the end, there are lots and lots of permutations here in terms of what might be the most efficient way to index. The real way to figure out the best indexes for your data set is to use explain() and hint() to test the various indexes with your specific query pattern and data set.