MongoDB aggregation or MapReduce pass down value to each document - mongodb

I am working on an IOT Project which associates fuel volume information.
I am iterating each MongoDB document and trying to plot the fuel volumes in a graph while omitting noise values. Each document contains the fuel value on every second,
When I plot the data as it is it gives a lot of noise such as
The vehicle is idle and the same fuel data point is plotted in the graph
When the vehicle is moving on a rough road it gives spikes
The data collection simply looks like below
{
_id : 1,
fuel : 100,
timestamp: 2020-09-18T06:06:01.628+00:00
},
{
_id : 2,
fuel : 100.1,
timestamp: 2020-09-18T06:06:02.628+00:
}
,{
_id : 1,
fuel : 98,
timestamp: 2020-09-18T06:06:03.628+00:
}
I am trying to compare each value of the document fuel value with the previous document fuel value and find the percentage. Then I can give a threshold like plus or minus 5% and filter certain data. This can solve the first problem I am having.
I don't have any idea how to avoid that spikey data points.
I tried the aggregation pipeline and I didn't find a suitable operator or way to save and compare previous data, is there any way to do this. Because I can not smooth data in memory as the data array can be very big from time to time.
Highly appreciate your support.

map/reduce fans out the input documents to processors, the documents are not necessarily processed sequentially therefore it doesn't make sense to talk about "previous" document.
With the aggregation pipeline you can $group or $bucket to create buckets of data and then operate on the buckets (e.g. to average the values within a bucket).
If you want to do complex math on the values you may need to do that in the application in any event.

Related

MongoDB Compass shows bad minimum value of data distribution of a key

I'm on MongoDB Compass Version 1.5.1 for mac.
When I look at distribution of values, Compass returns plots like the following:
As you can see, min and max value are available. But min values are wrong. I know the minimum values of those two keys are 1 and 1, not 9 and 13.
Does Anyone know how to fix that problem?
Got it. The standard report is based on a sample of max 1000 documents.
From the doc:
Sampling in MongoDB Compass is the practice of selecting a subset of
data from the desired collection and analyzing the documents within
the sample set.
Sampling is commonly used in statistical analysis because analyzing a
subset of data gives similar results to analyzing all of the data. In
addition, sampling allows results to be generated quickly rather than
performing a potentially long and computationally expensive collection
scan.
MongoDB Compass employs two distinct sampling mechanisms.
Collections in MongoDB 3.2 are sampled via the $sample operator in the
aggregation framework of the core server. This provides efficient
random sampling without replacement over the entire collection, or
over the subset of documents specified by a query.
Collections in MongoDB 3.0 and 2.6 are sampled via a backwards
compatible algorithm executed entirely within Compass. It comprises
three phases:
Query for a stream of _id values, limit 10000 descending by _id
Read the stream of _ids and save sampleSize randomly chosen values. We
employ reservoir sampling to perform this efficiently.
Then query the selected random documents by _id The choice of sampling > method is transparent in usage to the end-user.
sampleSize is currently set to 1000 documents.

mongodb integer index precision

TLDR: Given documents such as { x: 15, y: 15 }, can I create an index on 20 / x, 20 / y without adding those values as fields?
I'd like to index the x, y coordinates of a collection of documents. However my use case is such that:
Each document will have a unique x,y pair, so you can think of this as an ID.
Documents are added in blocks of 20x20 and are always fully populated (every x,y combo in that block will exists)
Lookups will only ever be against blocks of 20x20 and will always align on those boundaries. I'll never query for part of a block or for values that aren't multiples of 20.
For any possible block, there will either be no data or 4,000 results. A block is never sparsely populated.
I will be writing much more frequently than reading so write efficiency is very important.
An index of { x: 1, y: 1} would work but seems wasteful. With my use case the index would have an entry for every document in the collection! Since my queries will be aligned on multiples of 20, I really only need to index to that resolution. I'm expecting this would produce a smaller footprint on disk, in memory, and slightly faster lookups.
Is there a way to create an index like this or am I thinking in the wrong direction?
I'm aware that I could add block_x and block_y to my document, but the block concept doesn't exist in my application so it would be junk data.
Since my queries will be aligned on multiples of 20, I really only need to index to that resolution. I'm expecting this would produce a smaller footprint on disk, in memory, and slightly faster lookups.
Index entries are created per-document, since each index entry points to a single document. Lowering the "resolution" of the index would therefore impart no space savings at all, since the size of the index depends on the index type (single field, compound, etc. see https://docs.mongodb.com/manual/indexes/#index-types) and the number of documents in that collection.
I'm aware that I could add block_x and block_y to my document, but the block concept doesn't exist in my application so it would be junk data.
If the fields block_x and block_y would help you to more effectively find a document, then I wouldn't say it's junk data. It's true that you don't need to display those fields in your application, but they could be useful to speed up your queries nonetheless.

MongoDB and using DBRef with Spatial Data

I have a collection with 100 million documents of geometry.
I have a second collection with time data associated to each of the other geometries. This will be 365 * 96 * 100 million or 3.5 trillion documents.
Rather than store the 100 million entries (365*96) times more than needed, I want to keep them in separate collections and do a type of JOIN/DBRef/Whatever I can in MongoDB.
First and foremost, I want to get a list of GUIDs from the geometry collection by using a geoIntersection. This will filter it down to 100 million to 5000. Then using those 5000 geometries guids I want to filter the 3.5 trillion documents based on the 5000 goemetries and additional date criteria I specify and aggregate the data and find the average. You are left with 5000 geometries and 5000 averages for the date criteria you specified.
This is basically a JOIN as I know it in SQL, is this possible in MongoDB and can it be done optimally in say less than 10 seconds.
Clarify: as I understand, this is what DBrefs is used for, but I read that it is not efficient at all, and with dealing with this much data that it wouldn't be a good fit.
If you're going to be dealing with a geometry and its time series data together, it makes sense to store them in the same doc. A years worth of data in 15 minute increments isn't killer - and you definitely don't want a document for every time-series entry! Since you can retrieve everything you want to operate on as a single geometry document, it's a big win. Note that this also let's you sparse things up for missing data. You can encode the data differently if it's sparse rather than indexing into a 35040 slot array.
A $geoIntersects on a big pile of geometry data will be a performance issue though. Make sure you have some indexing on (like 2dsphere) to speed things up.
If there is any way you can build additional qualifiers into the query that could cheaply eliminate members from the more expensive search, you may make things zippier. Like, say the search will hit states in the US. You could first intersect the search with state boundaries to find the states containing the geodata and use something like a postal code to qualify the documents. That would be a really quick pre-search against 50 documents. If a search boundary was first determined to hit 2 states, and the geo-data records included a state field, you just winnowed away 96 million records (all things being equal) before the more expensive geo part of the query. If you intersect against smallish grid coordinates, you may be able to winnow it further before the geo data is considered.
Of course, going too far adds overhead. If you can correctly tune the system to the density of the 100 million geometries, you may be able to get the times down pretty low. But without actually working with the specifics of the problem, it's hard to know. That much data probably requires some specific experimentation rather than relying on a general solution.

Database solution to store and aggregate vectors?

I'm looking for a way to solve a data storage problem for a project.
The Data:
We have a batch process that generates 6000 vectors of size 3000 each daily. Each element in the vectors is a DOUBLE. For each of the vectors, we also generate tags like "Country", "Sector", "Asset Type" and so on (It's financial data).
The Queries:
What we want to be able to do is see aggregates by tag of each of these vectors. So for example if we want to see the vectors by sector, we want to get back a response that gives us all the unique sectors and a 3000x1 vector that is the sum of all the vectors of each element tagged by that sector.
What we've tried:
It's easy enough to implement a normalized star schema with 2 tables, one with the tagging information and an ID and a second table that has "VectorDate, ID, ElementNumber, Value" which will have a row to represent each element for each vector. Unfortunately, given the size of the data, it means we add 18 million records to this second table daily. And since our queries need to read (and add up) all 18 million of these records, it's not the most efficient of operations when it comes to disk reads.
Sample query:
SELECT T1.country, T2.ElementNumber, SUM(T2.Value)
FROM T1 INNER JOIN T2 ON T1.ID=T2.ID
WHERE VectorDate = 20140101
GROUP BY T1.country, T2.ElementNumber
I've looked into NoSQL solutions (which I don't have experience with) but seen that some, like MongoDB allow for storing entire vectors as part of a single document - but I'm unsure if they would allow aggregations like we're trying efficiently (adding each element of the vector in a document to the corresponding element of other documents' vectors). I read the $unwind operation required isn't that efficient either?
It would be great if someone could point me in the direction of a database solution that can help us solve our problem efficiently.
Thanks!

Is Mongodb geohaystack the same with standard mongodb spatial index?

It seems that mongodb has 2 types of geospatial index.
http://www.mongodb.org/display/DOCS/Geospatial+Indexing
The standard one. With a note:
You may only have 1 geospatial index per collection, for now. While
MongoDB may allow to create multiple indexes, this behavior is
unsupported. Because MongoDB can only use one index to support a
single query, in most cases, having multiple geo indexes will produce
undesirable behavior.
And then there is this so called geohaystack thingy.
http://www.mongodb.org/display/DOCS/Geospatial+Haystack+Indexing
They both claim to use the same algorithm. They both turn earth into several grids. And then search based on that.
So what's the different?
Mongodb doesn't seem to use Rtree and stuff right?
NB: Answer to this question that How does MongoDB implement it's spatial indexes? says that 2d index use geohash too.
The implementation is similar, but the use case difference is described on the Geospatial Haystack Indexing page.
The haystack indices are "bucket-based" (aka "quadrant") searches tuned for small-region longitude/latitude searches:
In addition to ordinary 2d geospatial indices, mongodb supports the use
of bucket-based geospatial indexes. Called "Haystack indexing", these
indices can accelerate small-region type longitude / latitude queries
when additional criteria is also required.
For example, "find all restaurants within 25 miles with name 'foo'".
Haystack indices allow you to tune your bucket size to the distribution
of your data, so that in general you search only very small regions of
2d space for a particular kind of document. They are not suited for
finding the closest documents to a particular location, when the
closest documents are far away compared to bucket size.
The bucketSize parameter is required, and determines the granularity of the haystack index.
So, for example:
db.places.ensureIndex({ pos : "geoHaystack", type : 1 }, { bucketSize : 1 })
This example bucketSize of 1 creates an index where keys within 1 unit of longitude or latitude are stored in the same bucket. An additional category can also be included in the index, which means that information will be looked up at the same time as finding the location details.
The B-tree representation would be similar to:
{ loc: "x,y", category: z }
If your use case typically searches for "nearby" locations (i.e. "restaurants within 25 miles") a haystack index can be more efficient. The matches for the additional indexed field (eg. category) can be found and counted within each bucket.
If, instead, you are searching for "nearest restaurant" and would like to return results regardless of distance, a normal 2d index will be more efficient.
There are currently (as of MongoDB 2.2.0) a few limitations on haystack indexes:
only one additional field can be included in the haystack index
the additional index field has to be a single value, not an array
null long/lat values are not supported
Note: distance between degrees of latitude will vary greatly (longitude, less so). See: What is the distance between a degree of latitude and longitude?.