TLDR: Given documents such as { x: 15, y: 15 }, can I create an index on 20 / x, 20 / y without adding those values as fields?
I'd like to index the x, y coordinates of a collection of documents. However my use case is such that:
Each document will have a unique x,y pair, so you can think of this as an ID.
Documents are added in blocks of 20x20 and are always fully populated (every x,y combo in that block will exists)
Lookups will only ever be against blocks of 20x20 and will always align on those boundaries. I'll never query for part of a block or for values that aren't multiples of 20.
For any possible block, there will either be no data or 4,000 results. A block is never sparsely populated.
I will be writing much more frequently than reading so write efficiency is very important.
An index of { x: 1, y: 1} would work but seems wasteful. With my use case the index would have an entry for every document in the collection! Since my queries will be aligned on multiples of 20, I really only need to index to that resolution. I'm expecting this would produce a smaller footprint on disk, in memory, and slightly faster lookups.
Is there a way to create an index like this or am I thinking in the wrong direction?
I'm aware that I could add block_x and block_y to my document, but the block concept doesn't exist in my application so it would be junk data.
Since my queries will be aligned on multiples of 20, I really only need to index to that resolution. I'm expecting this would produce a smaller footprint on disk, in memory, and slightly faster lookups.
Index entries are created per-document, since each index entry points to a single document. Lowering the "resolution" of the index would therefore impart no space savings at all, since the size of the index depends on the index type (single field, compound, etc. see https://docs.mongodb.com/manual/indexes/#index-types) and the number of documents in that collection.
I'm aware that I could add block_x and block_y to my document, but the block concept doesn't exist in my application so it would be junk data.
If the fields block_x and block_y would help you to more effectively find a document, then I wouldn't say it's junk data. It's true that you don't need to display those fields in your application, but they could be useful to speed up your queries nonetheless.
Related
When a PostgreSQL query's execution plan is generated, how does an index's fill factor affect whether the index gets used in favor of a sequential scan?
A fellow dev and I were reviewing the performance of a PostgreSQL (12.4) query with a windowed function of row_number() OVER (PARTITION BY x, y, z) and seeing if we could speed it up with an index on said fields. We found that during the course of the query the index would get used if we created it with a fill factor >= 80 but not at 75. This was a surprise to us as we did not expect the fill factor to be considered in creating the query plan.
If we create the index at 75 and then insert rows, thereby packing the pages > 75, then once again the index gets used. What causes this behavior and should we consider it when selecting an index's fill factor on a table that will have frequent inserts and deletes and be periodically vacuumed?
If we create the index at 75 and then insert rows, thereby packing the pages > 75, then once again the index gets used.
So, it is not the fill factor, but rather the size of the index (which is influenced by the fill factor). This agrees with my memory that index size is a (fairly weak) influence on the cost estimate. That influence is almost zero if you are reading only one tuple, but larger if you area reading many tuples.
If the cost estimates of the plan are close to each other, then small differences such as this will be enough to drive one over the other. But that doesn't mean you should worry about them. If one plan is clearly superior to the other, then you should think about why the estimates are so close together to start with when the realities are not close together.
I have a set of IDs which are numbers anywhere between 8 and 11 digits long, and there are only 300K of them (so not exactly sequential etc.). These are stored in collection A.
I have a collection B with millions of entries in which every entry has an array of these IDs, and every array could have thousands of these IDs. I need to index this field too (i.e. hundreds of millions, potentially up to a billion+ entries). When I indexed it, the index turned out to be very large, way over the RAM size of the cluster.
Would it be worth trying to compress the value of each ID down from 8-11 numbers into some small alphanumeric encoded string? Or simply re-number them from e.g. 1 - 300,000 sequentially (and maintain a mapping of this)? Would that have a significant impact on the index size, or is it not worth the effort?
The size of your indexed field affects the size of the index. You can run a collStats command to check the size of the index and compare the size of your indexed field with the total size that MongoDb needs to create the index.
Mongo already performs some compressions on indexes so trying to encode your field in an alphanumeric encoded string is probably not going to have a benefit or a marginal one.
Using a smaller numeric type is going to save a small amount of size in your index, but if you need to maintain a mapping, it´s probably not worth the effort and it is probably overcomplicating things.
The size of the index for a collection with 300K elements only indexing the 11 digits ID should be small, something around a few Mb. So it´s very likely that you don't have
storage or memory issues with that index size.
Regarding your second collection, if you reduce some bytes in each ID, you can get some reduction of the size of the index.
e.g. Reducing the size of each ID from 8bytes to 4bytes and having about 1Billion elements, you are reducing some GB the size of the index.
Reducing the size of the index and the collection B some GB could be an interesting save, so based on your needs, it could worth the effort to modify the IDs to use the smallest type possible. However, you could still have or you could have it in the near future if the collections keep growing, the memory issues due to the index not fitting in memory. So sharding the collection could be a good possibility.
Hashed Column index you can create which will give more or less same performance in case you interested more on saving the size of index.
You can check with some data how much % size of data you saving for index and performance impact and take decision
I have a table with 10+ million tuples in my Postgres database that I will be querying. There are 3 fields, "layer" integer, "time", and "cnt". Many records share the same values for "layer" (distributed from 0 to about 5 or so, heavily concentrated between 0-2). "time" has has relatively unique values, but during queries the values will be manipulated such that some will be duplicates, and then they will be grouped by to account for those duplicates. "cnt" is just used to count.
I am trying to query records from certain layers (WHERE layer = x) between certain times (WHERE time <= y AND time >= z), and I will be using "time" as my GROUP BY field. I currently have 4 indexes, one each on (time), (layer), (time, layer), and (layer, time) and I believe this is too many (I copied this from an template provided by my supervisor).
From what I have read online, fields with relatively unique values, as well as fields that are frequently-searched, are good candidates for indexing. I have also seen that having too many indexes will hinder the performance of my query, which is why I know I need to drop some.
This leads me to believe that the best index choice would be on (time, layer) (I assume a btree is fine because I have not seen reason to use anything else), because while I query slightly more frequently on layer, time better fits the criterion of having more relatively unique values. Or, should I just have 2 indices, 1 on layer and 1 on time?
Also, is an index on (time, layer) any different from (layer, time)? Because that is one of the confusions that led me to have so many indices. The provided template has multiple indices with the same 3 attributes, just arranged in different orders...
Your where clause appears to be:
WHERE layer = x and time <= y AND time >= z
For this query, you want an index on (layer, time). You could include cnt in the index so the index covers the query -- that is, all data columns are in the index so the original data pages don't need to be accessed for the data (they may be needed for locking information).
Your original four indexes are redundant, because the single-column indexes are not needed. The advice to create all four is not good advice. However, (layer, time) and (time, layer) are different indexes and under some circumstances, it is a good idea to have both.
I have a collection with 100 million documents of geometry.
I have a second collection with time data associated to each of the other geometries. This will be 365 * 96 * 100 million or 3.5 trillion documents.
Rather than store the 100 million entries (365*96) times more than needed, I want to keep them in separate collections and do a type of JOIN/DBRef/Whatever I can in MongoDB.
First and foremost, I want to get a list of GUIDs from the geometry collection by using a geoIntersection. This will filter it down to 100 million to 5000. Then using those 5000 geometries guids I want to filter the 3.5 trillion documents based on the 5000 goemetries and additional date criteria I specify and aggregate the data and find the average. You are left with 5000 geometries and 5000 averages for the date criteria you specified.
This is basically a JOIN as I know it in SQL, is this possible in MongoDB and can it be done optimally in say less than 10 seconds.
Clarify: as I understand, this is what DBrefs is used for, but I read that it is not efficient at all, and with dealing with this much data that it wouldn't be a good fit.
If you're going to be dealing with a geometry and its time series data together, it makes sense to store them in the same doc. A years worth of data in 15 minute increments isn't killer - and you definitely don't want a document for every time-series entry! Since you can retrieve everything you want to operate on as a single geometry document, it's a big win. Note that this also let's you sparse things up for missing data. You can encode the data differently if it's sparse rather than indexing into a 35040 slot array.
A $geoIntersects on a big pile of geometry data will be a performance issue though. Make sure you have some indexing on (like 2dsphere) to speed things up.
If there is any way you can build additional qualifiers into the query that could cheaply eliminate members from the more expensive search, you may make things zippier. Like, say the search will hit states in the US. You could first intersect the search with state boundaries to find the states containing the geodata and use something like a postal code to qualify the documents. That would be a really quick pre-search against 50 documents. If a search boundary was first determined to hit 2 states, and the geo-data records included a state field, you just winnowed away 96 million records (all things being equal) before the more expensive geo part of the query. If you intersect against smallish grid coordinates, you may be able to winnow it further before the geo data is considered.
Of course, going too far adds overhead. If you can correctly tune the system to the density of the 100 million geometries, you may be able to get the times down pretty low. But without actually working with the specifics of the problem, it's hard to know. That much data probably requires some specific experimentation rather than relying on a general solution.
I'm looking for a way to solve a data storage problem for a project.
The Data:
We have a batch process that generates 6000 vectors of size 3000 each daily. Each element in the vectors is a DOUBLE. For each of the vectors, we also generate tags like "Country", "Sector", "Asset Type" and so on (It's financial data).
The Queries:
What we want to be able to do is see aggregates by tag of each of these vectors. So for example if we want to see the vectors by sector, we want to get back a response that gives us all the unique sectors and a 3000x1 vector that is the sum of all the vectors of each element tagged by that sector.
What we've tried:
It's easy enough to implement a normalized star schema with 2 tables, one with the tagging information and an ID and a second table that has "VectorDate, ID, ElementNumber, Value" which will have a row to represent each element for each vector. Unfortunately, given the size of the data, it means we add 18 million records to this second table daily. And since our queries need to read (and add up) all 18 million of these records, it's not the most efficient of operations when it comes to disk reads.
Sample query:
SELECT T1.country, T2.ElementNumber, SUM(T2.Value)
FROM T1 INNER JOIN T2 ON T1.ID=T2.ID
WHERE VectorDate = 20140101
GROUP BY T1.country, T2.ElementNumber
I've looked into NoSQL solutions (which I don't have experience with) but seen that some, like MongoDB allow for storing entire vectors as part of a single document - but I'm unsure if they would allow aggregations like we're trying efficiently (adding each element of the vector in a document to the corresponding element of other documents' vectors). I read the $unwind operation required isn't that efficient either?
It would be great if someone could point me in the direction of a database solution that can help us solve our problem efficiently.
Thanks!