How to make indexes in MongoDB when there are a huge number of filters?

How to make indexes in MongoDB when there are a huge number of filters? - mongodb

There are over 30 different filters
It can be an exact value, a selection from values, or ranges of values.
Various combinations are possible, i.e. there are more than several hundred possible combinations of filters
How, then, should the indexes be composed correctly?
Does it make sense to make an index on each field?
Count works insanely slow in this case.

Related

Most appropriate analysis method - Clustering?

I have 2 large data frames with similar variables representing 2 separate surveys. Some rows (participants) in each data frame correspond to the other and I would like to link these two together.
There is an index in both dataframes though this index indicates locality of the survey (i.e region) and not individual IDs.
Merging is not possible as in most cases there is an identical index values for different participants.
Given that merging on an index value from the 2 data frames is not possible, I wish to compare similar variables (binary) from both data frames that (in addition to the index values common to both data frame) in order to give me a highest likelihood of a match. I can then (with some margin of error) match rows with the most similar values for similar variables and merge them together.
What do you think would be the appropriate method for doing this? Clustering?
Best,
James

That obviously is not clustering. You don't want large groups of records.
What you want to do is an approximate JOIN.

plans and subset of rows in pyfftw

I want to use pyfft to repeatedly compute the discrete Fourier transform of a subset of rows for a two-dimensional array. I do not know in advance which rows I need to transform, that depends on the output from the previous round. I do know that doing it for all rows is wasteful.
It is my understanding that a 'plan' in FFTW3 is associated with the type of transform (c2c, r2c, etc) and the input/output length, which is always a vector in the 1D case. In pyfftw it looks like a 'plan' is associated to the type of transform and the input/output shape, so my interpretation is that it uses the same FFTW3 plan for every row.
My question is: is it possible to use the same FFTW3 plan for some of the rows, without creating separate pyfftw.FFTW objects for all possible combinations of rows?
On a different note, I am wondering how pyfftw uses multiple cores: does it use multiple cores for each row (this appears natural in view of FFTW3 documentation) or does it farm out different rows to different cores (which was my initial assumption)?

If you can create a numpy array from a view, you can plan for it with pyFFTW - all valid numpy arrays should work just fine.
This means several things:
Your array needs to have regular strides, but those strides can be arbitrary.
ND arrays are planned as ND transforms, with the selected axes being used.
You can probably do something cunning with stride tricks and it will probably work (but might not do what you expect if you do something too nefarious like overlapping rows and then use threads).
One solution that I've used quite a bit is to copy the rows that you want to transform into an interim array, and transforming that. You might well find that's the fastest option (particularly when you can allow for getting the byte offset correct).
Obviously, this doesn't work if you always have a different number of rows. You might still find that if you plan for the largest number of rows that are transformed and then copy in a subset you still do faster than otherwise.
The problem you're going to come up against, even if you go down to the C level, is that the planning overhead might well dominate if you're changing your transform sizes often.
You could also try pyfftw.interfaces.numpy_fft which is normally faster that numpy and has the ability to cache repeated transform sizes.

Selecting the proper db index

I have a table with 10+ million tuples in my Postgres database that I will be querying. There are 3 fields, "layer" integer, "time", and "cnt". Many records share the same values for "layer" (distributed from 0 to about 5 or so, heavily concentrated between 0-2). "time" has has relatively unique values, but during queries the values will be manipulated such that some will be duplicates, and then they will be grouped by to account for those duplicates. "cnt" is just used to count.
I am trying to query records from certain layers (WHERE layer = x) between certain times (WHERE time <= y AND time >= z), and I will be using "time" as my GROUP BY field. I currently have 4 indexes, one each on (time), (layer), (time, layer), and (layer, time) and I believe this is too many (I copied this from an template provided by my supervisor).
From what I have read online, fields with relatively unique values, as well as fields that are frequently-searched, are good candidates for indexing. I have also seen that having too many indexes will hinder the performance of my query, which is why I know I need to drop some.
This leads me to believe that the best index choice would be on (time, layer) (I assume a btree is fine because I have not seen reason to use anything else), because while I query slightly more frequently on layer, time better fits the criterion of having more relatively unique values. Or, should I just have 2 indices, 1 on layer and 1 on time?
Also, is an index on (time, layer) any different from (layer, time)? Because that is one of the confusions that led me to have so many indices. The provided template has multiple indices with the same 3 attributes, just arranged in different orders...

Your where clause appears to be:
WHERE layer = x and time <= y AND time >= z
For this query, you want an index on (layer, time). You could include cnt in the index so the index covers the query -- that is, all data columns are in the index so the original data pages don't need to be accessed for the data (they may be needed for locking information).
Your original four indexes are redundant, because the single-column indexes are not needed. The advice to create all four is not good advice. However, (layer, time) and (time, layer) are different indexes and under some circumstances, it is a good idea to have both.

MongoDB and using DBRef with Spatial Data

I have a collection with 100 million documents of geometry.
I have a second collection with time data associated to each of the other geometries. This will be 365 * 96 * 100 million or 3.5 trillion documents.
Rather than store the 100 million entries (365*96) times more than needed, I want to keep them in separate collections and do a type of JOIN/DBRef/Whatever I can in MongoDB.
First and foremost, I want to get a list of GUIDs from the geometry collection by using a geoIntersection. This will filter it down to 100 million to 5000. Then using those 5000 geometries guids I want to filter the 3.5 trillion documents based on the 5000 goemetries and additional date criteria I specify and aggregate the data and find the average. You are left with 5000 geometries and 5000 averages for the date criteria you specified.
This is basically a JOIN as I know it in SQL, is this possible in MongoDB and can it be done optimally in say less than 10 seconds.
Clarify: as I understand, this is what DBrefs is used for, but I read that it is not efficient at all, and with dealing with this much data that it wouldn't be a good fit.

If you're going to be dealing with a geometry and its time series data together, it makes sense to store them in the same doc. A years worth of data in 15 minute increments isn't killer - and you definitely don't want a document for every time-series entry! Since you can retrieve everything you want to operate on as a single geometry document, it's a big win. Note that this also let's you sparse things up for missing data. You can encode the data differently if it's sparse rather than indexing into a 35040 slot array.
A $geoIntersects on a big pile of geometry data will be a performance issue though. Make sure you have some indexing on (like 2dsphere) to speed things up.
If there is any way you can build additional qualifiers into the query that could cheaply eliminate members from the more expensive search, you may make things zippier. Like, say the search will hit states in the US. You could first intersect the search with state boundaries to find the states containing the geodata and use something like a postal code to qualify the documents. That would be a really quick pre-search against 50 documents. If a search boundary was first determined to hit 2 states, and the geo-data records included a state field, you just winnowed away 96 million records (all things being equal) before the more expensive geo part of the query. If you intersect against smallish grid coordinates, you may be able to winnow it further before the geo data is considered.
Of course, going too far adds overhead. If you can correctly tune the system to the density of the 100 million geometries, you may be able to get the times down pretty low. But without actually working with the specifics of the problem, it's hard to know. That much data probably requires some specific experimentation rather than relying on a general solution.

PIG - filtering groupby by the contents of the group

I am new to pig, and I am wondering if I can do any inter-group filtering easily with it.
I have some data grouped by userid and some timestamps. I want to take only the groups that have two consecutive timestamps that are less than 30 minutes apart. Is this easy to express in Pig?
Thanks a lot!

The cleanest way to do this would be to write a UDF. The function would take a bag of timestamps as input, order them, and compute the minimum difference between timestamps. You could then filter your data based on the output of this UDF.
It is possible to do this in pure Pig Latin, if you really want to, although it involves more temporary data and map-reduce cycles, which means it may not be worth it. This would involve FLATTENing the bag of timestamps twice to get its cross-product, creating an indicator variable for any pairs of timestamps separated by less than 30 minutes, and then summing this variable for each user. Any user with a sum greater than zero has the property you desire.
Give it a go, and if you run into any specific issues, post another question outlining exactly where you're stuck.