Sampling records from MongoDB database allowing duplicates

Sampling records from MongoDB database allowing duplicates - mongodb

I am trying to get some random records from my small MongoDB database but I want to sometimes see duplicates.
Sample function doesn't really work for me because it might create duplicates only with specific conditions my query and database won't meet.
Is there any reasonable way of getting some random records while allowing duplicates without downloading entire collection?

Related

Mongoose skip and limit (MongoDB Pagination)

I am using skip and limit in mongodb mongoose it returns random documents.
Sometimes it returns same documents.
I am trying to get limit number of documents and after skip i want to get next limit documents.

I guess you are trying to use the pagination concept here. In both SQL and NoSQL, the data must be sorted in a specific order to achieve pagination without jumbled data on each db call.
for example:
await Event.find({}).sort({"createdDate": 0}).skip(10).limit(10);
I'm trying to fetch the 10-20 data in the above case, which is sorted using createdDate. So, the data won't shuffled while fetching, unless you insert or delete the data.
I hope this answers your question

Updating data in Mongo sorted by a particular field

I posted this question on Software Engineering portal without conducting any tests. It was also brought to my notice that this needs to be posted on SO, not there. Thanks for the help in advance!
I need Mongo to return the documents sorted by a field value. The easiest way to achieve this would be running the command db.collectionName.find().sort({field:priority}), however, I tried this method on a dummy collection of 1000 documents; it runs in 22ms. I also tried running db.collectionName.find() on the same data, it runs in 3ms, which means that Mongo is taking time to sort and return the documents (which is understandable). Both tests were done in the same environment and were done by adding .explain("executionStats") to the query.
I will be working with a large amount of data and concurrent requests to access DB, so I need the querying to be faster. My question is, is there a way to always keep the data sorted by a field in the DB so that I don't have to sort it over and over for all requests? For instance, some sort of update command that could sort the entire DB once a week or so?

A non-unique index with that field in this collection will give the results you're after and avoid the inefficient in-memory sorting.

What is the preferred way to add many fields to all documents in a MongoDB collection?

I have have a Python application that is iteratively going through every document in a MongoDB (3.0.2) collection (typically between 10K and 1M documents), and adding new fields (probably doubling/tripling the number of fields in the document).
My initial thought was that I would use upsert the entire of the revised documents (using pyMongo) - now I'm questioning that:
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
Also, is it better to perform a write to the collection on a document by document basis or in bulk?

this is actually a great question that can be solved a few different ways depending on how you are managing your data.
if you are upserting additional fields does this mean your data is appending additional fields at a later point in time with the only changes being the addition of the additional fields? if so you could set the ttl on your documents so that the old ones drop off over time. keep in mind that if you do this you will want to set an index that sorts your results by descending _id so that the most recent additions are selected before the older ones.
the benefit of this of doing it this way is that your are continually writing data as opposed to seeking and updating data so it is faster.
in regards to upserts vs bulk inserts. bulk inserts are always faster than upserts since bulk upserting requires you to find the original document first.
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
you really need to understand your data fully to determine what is best but if only change to the data is additional fields or changes that only need to be considered from that point forward then bulk inserting and setting a ttl on your older data is the better method from the stand point of write operations as opposed to seek, find and update. when using this method you will want to db.document.find_one() as opposed to db.document.find() so that only your current record is returned.
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
bulk inserts will be faster than inserting each one sequentially.

How to paginate query results in NoSQL databases when there are no unique fields included in the projection?

I've heard using MongoDB's skip() to batch query results is a bad idea because it can lead to the server becoming IO bound as it has to 'walk through' all the results. I want to only return a maximum of 200 documents at a time, and then the user will be able to fetch the next 200 if they want (assuming they haven't limited it to less).
Initially I read up on paginating results and most things I read said the easiest way in MongoDB at least is to modify the query criteria to simulate skipping through.
For example if a field called account number on the last document is 28022004, then the next query should have "accNumber > 28022004" in the criteria. But what if there are no unique fields included in the projection? What if the user wants to sort the records by a non-unique field?

Mongodb - add if not exists (non id field)

I have a large number of records to iterate over (coming from an external data source) and then insert into a mongo db.
I do not want to allow duplicates. How can this be done in a way that will not affect performance.
The number of records is around 2 million.

I can think of two fairly straightforward ways to do this in mongodb, although a lot depends upon your use case.
One, you can use the upsert:true option to update, using whatever you define as your unique key as the query for the update. If it does not exist it will insert it, otherwise it will update it.
http://docs.mongodb.org/manual/reference/method/db.collection.update/
Two, you could just create a unique index on that key and then insert ignoring the error generated. Exactly how to do this will be somewhat dependent on language and driver used along with version of mongodb. This has the potential to be faster when performing batch inserts, but YMMV.

2 million is not a huge number that will affect performance, split your records fields into diffent collections will be good enough.
i suggest create a unique index on your unique key before insert into the mongodb.
unique index will filter redundant data and lose some records and you can ignore the error.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse