I have a collection of documents that is very very large (hundreds of million documents)
each document is as such
date in YYYY/MM/DD format
name
type1
type2
value
There are ~50 different names, ~60 different type1, ~20 different type2
I need to read from this database, usually it is either:
a unique set of (name,type1,type2), but with all dates
a few dates for all type1
List item
Currently I am reading without any indexing and it is very slow! Much slower than having a few flat sql tables...
How can I use index to speed up this database?
thanks
date in YYYY/MM/DD format
There's MongoDB datetime format. Use it. It uses less memory than the string and it doesn't need additional conventions. Your format is sane in the sense that it's lexicographical ordering is equivalent to chronological ordering for dates between 0001-01-01 and 9999-12-31, but the built-in datatype is definitely preferable for range queries.
There are ~50 different names, ~60 different type1, ~20 different type2
Your keys have very low selectivity, so individual indexes are probably pointless
I need to read from this database, usually it is either:
a unique set of (name,type1,type2), but with all dates
Use a compound index for {name, type1, type2}. If you also need chronological ordering, you might want to add date for sorting or use a monotic primary key like ObjectId and rely on natural ordering.
db.collection.ensureIndex({'name' : 1, 'type1' : 1, 'type2' : 1, 'date' : 1});
a few dates for all type1
What is 'a few dates'? I assume you mean something like "all dates in a given date range"? Use an index for date. date should naturally have a much better selectivity, so individual keys make sense.
db.collection.ensureIndex({'date' : 1});
List item
List all items? Any ordering? You need something more specific. Keep in mind that skip/take is expensive.
Related
I have been trying to figure out a way to query a list of documents where I have a range filter on one field and order by another field which of course isn't possible, see my other question: Order by timestamp with range filter on different field Swift Firestore
But is it possible to save documents with the timestamp as id and then it would sort by default? Or maybe hardcode an ID, then retrieve the last created document id and increase id by one for the next post to be uploaded?
This shows how the documents is ordered in the collection
Any ideas how to store documents so they are ordered by created at in the collection?
It will order by document ID (ascending) by default in Swift.
You can use .order(by: '__id__') but the better/documented way is with FieldPath documentID() I don't really know Swift but I assume that it's something like...
.order(by: FirebaseFirestore.FieldPath.documentID())
JavaScript too has an internal variable which simply returns __id__.
.orderBy(firebase.firestore.FieldPath.documentId())
Interestingly enough __name__ also works, but that sorts the whole path, including the collection name (and also the id of course).
If I correctly understood your need, by doing the following you should get the correct order:
For each document, add a specific field of type number, called for example sortNbr and assign as value a timestamp you calculate (e.g. the epoch time, see Get Unix Epoch Time in Swift)
Then build a query sorted on this field value, like:
let docRef = db.collection("xxxx")
docRef.order(by: "sortNbr")
See the doc here: https://firebase.google.com/docs/firestore/query-data/order-limit-data
Yes, you can do this.
By default, a query retrieves all documents that satisfy the query in
ascending order by document ID.
See the docs here: https://firebase.google.com/docs/firestore/query-data/order-limit-data
So if you find a way to use a timestamp or other primary key value where the ascending lexicographical ordering is what you want, you can filter by any fields and still have the results sorted by the primary key, ascending.
Be careful to zero-pad your numbers to the maximum precision if using a numeric key like seconds since epoch or an integer sequence. 10 is lexicographical less than 2, but 10 is greater than 02.
Using ISO formatted YYYY-mm-ddTHH:MM:SS date-time strings would work, because they sort naturally in ascending order.
The order of the documents shown in the Firebase console is mostly irrelevant to the functioning of your code that uses Firestore. The console is just for browsing data, and that sorting scheme makes it relatively intuitive to find a document you might be looking for, if you know its ID. You can't change this sort order in the console.
Your code is obviously going to have other requirements, and those requirements should be coded into your queries, without regarding any sort order you see in the dashboard. If you want time-based ordering of your documents, you'll have to store some sort of timestamp field in the document, and use that for ordering. I don't recommend using the timestamp as the ID of a document, as that could cause problems for you in the future.
Why do we when the created_at field when the timestamp can be found in the first 4 bytes of the ObjectId
ObjectId("5349b4ddd2781d08c09890f4").getTimestamp()
Taken from MongoDB Docs
There are several cases where it makes sense to do so:
When you need better precision - ObjectId.getTimestamp() is precise up to seconds, while Date fields store milliseconds. Compare this in mongo shell: new Date() yields ISODate("2016-01-03T21:21:38.032Z"), while ObjectId().getTimestamp() yields ISODate("2016-01-03T21:21:50Z").
When you are not using ObjectId at all - it is often taken as a given that _id field should be populated with ObjectId, while in fact ObjectId is only a default used by most of the drivers and MongoDB itself doesn't impose it - on the contrary, it is encouraged to use any "natural" unique ID if it exists for the documents. In this case though you will have to store "creation timestamp" yourself if you need it.
Usability - if you rely on the presence of this field and the data in it, it might be better, at least from design standpoint, to be explicit about it. This is more a matter of taste though. However, as noted in comments, if you also want to filter or sort by "creation timestamp" - it will be easier to do having a dedicated field for it and using query operators like $gt, for example, directly on it.
As you said, like it states clearly in the documentation:
Since the _id ObjectId by default stores the 4 byte timestamp, in most cases you do not need to store the creation time of any document.
And you may use ObjectId("5349b4ddd2781d08c09890f4").getTimestamp() in order to get the creation date in ISO date format.
It is also a matter of convenience to the costumer (us) to have a service like that, as it makes our attempt of getting the creating date and performing actions on it much more intuitive and easy.
I would like to store and query documents that contain a from-to date range, where the range represents an interval when the document has been valid.
Typical use cases in lucene/solr documentation address the opposite problem: Querying for documents that contain a single timestamp and this timestamp is contained in a date range provided as query parameter. (createdate:[1976-03-06T23:59:59.999Z TO *])
I want to use the edismax parser.
I have found the ms() function, which seems to me to be designed for boosting score only, not to eliminate non-matching results entirely.
I have found the article Spatial Search Tricks for People Who Don't Have Spatial Data, where the problem described by me is said to be Easy... (Find People Alive On May 25, 1977).
Is there any simpler way to express something like
date_from_query:[valid_from_field TO valid_to_field] than using the spacial approach?
The most direct approach is to create the bounds yourself:
valid_from_field:[* TO date_from_query] AND valid_to_field:[date_from_query TO *]
.. which would give you documents where the valid_from_field is earlier than the date you're querying, and the valid_to_field is later than the date you're querying, in effect, extracting the interval contained between valid_from_field and valid_to_field. This assumes that neither field is multi valued.
I'd probably add it as a filter query, since you don't need any scoring from it, and you probably want to allow other search queries at the same time.
Almost all my documents include 2 fields, start timestamp and final timestamp. And in each query, I need to retrieve elements which are in selected period of time. so start should be after selected value and final should be before selected timestamp.
query looks like
db.collection.find({start:{$gt:DateTime(...)}, final:{$lt:DateTime(...)}})
So what is the best indexing strategy for that scenario?
By the way, which is better for performance - to store date as datetimes or as unix timestamps, which is long value itself
To add a little more to baloo's answer.
On the time-stamp vs. long issue. Generally the MongoDB server will not see a difference. The BSON encoding length is the same (64 bits). You may see a performance different on the client side depending on the driver's encoding. As an example, on the Java side a using the 10gen driver a time-stamp is rendered as Date that is a lot heavier than Long. There are drivers that try to avoid that overhead.
The other issue is that you will see a performance improvement if you close the range for the first field of the index. So if you use the index suggested by baloo:
db.collection.ensureIndex({start: 1, final: 1})
The query will perform (potentially much) better if it is:
db.collection.find({start:{$gt:DateTime(...),$lt:DateTime(...)},
final:{$lt:DateTime(...)}})
Conceptually, if you think of the indexes as a a tree the closed range limits both sides of the tree instead of just one side. Without the closed range the server has to "check" all of the entries with a start greater than the time stamp provided since it does not know of the relation between start and final.
You may even find that that the query performance is no better using a single field index like:
db.collection.ensureIndex({start: 1})
Most of the savings is from the first field's pruning. The case where this will not be the case is when the query is covered by the index or the ordering/sort for the results can be derived from the index.
You can use a Compound index in order to create an index for multiple fields.
db.collection.ensureIndex({start: 1, final: 1})
Compare different queries and indexes by using explain() to get the most out of your database
I'm trying to store "Votes" in MongoDB and I am stuck on how to proceed in an efficient way.
Basically , I have a question with several options like A B C D ...(6 total).
I am giving voters the option to choose an option and want to save the "Vote" with fields like:
MongoDate, option, voter name, and maybe couple more fields.
I am planning to have unlimited "Votes" in the thousands and even in millions on a given question.
In terms of retrieving the data : I would like to be able to query it mainly by Date and present in charts, like a stock price with hourly, daily, monthly... intervals
In other words it is like time series.
I am not sure on the "format" of the document in MongoDB;
One reasonable way to do it would be to have a votes collection, where each document looks like:
{
v: 'a', //voted for the first option
d: Date(), //the date
n: 'Bob',
...
}
Then, index on the date field. Be careful not to shard on the date field alone, though, if you have to end up sharding this. I listed the field names as single characters because the name of every field is stored in mongoDB, so for better space efficiency, you should use shorter names. If you aren't concerned about space, a longer, more informative name is probably fine.