What is the best way to delete large numbers of documents from a Cloudant database? - ibm-cloud

If you are deleting a large number of documents from a Cloudant database, say 2 million, is it best to batch the delete calls, or is there another approach?

Yes, batching the deletes with a _bulk_docs request is the best way to approach this. Making two million single document delete requests would be impractical.
Deletes are just updates where the new doc revision has the deleted flag set.
The batch size is a bit of a compromise between not too small, to minimise the number of requests; and not so large that it puts too much work on the server. A thousand documents is a reasonable starting point. In this example it would still require 2,000 calls to delete all the documents.
For an example of how to create a batch and issue a bulk delete command see this blog post.

Related

Firestore full collection update for schema change

I am attempting to figure out a solid strategy for handling schema changes in Firestore. My thinking is that schema changes would often require reading and then writing to every document in a collection (or possibly documents in a different collection).
Here are my concerns:
I don't know how large the collection will be in the future. Will I hit any limitations on how many documents can be read in a single query?
My current plan is to run the schema change script from Cloud Build. Is it possible this will timeout?
What is the most efficient way to do the actual update? (e.g. read document, write update to document, repeat...)
Should I be using batched writes?
Also, feel free to tell me if you think this is the complete wrong approach to implementing schema changes, and suggest a better solution.
I don't know how large the collection will be in the future. Will I hit any limitations on how many documents can be read in a single query?
If the number of documents gets too large to handle in a single query, you can start paginating the results.
My current plan is to run the schema change script from Cloud Build. Is it possible this will timeout?
That's impossible to say at this moment.
What is the most efficient way to do the actual update? (e.g. read document, write update to document, repeat...)
If you need the existing contents of a document to determine its new contents, then you'll indeed need to read it. If you don't need the existing contents, all you need is the path, and you can consider using the Node.js API to only retrieve the document IDs.
Should I be using batched writes?
Batched writes have no performance advantages. In fact, they're often slower than sending the individual update calls in parallel from your code.

Distributing big data storage for non-relational data

The problem consists of a lot (apprx. 500 million per day) of non-relational messages of relatively small size (apprx. 1KB). The messages are written once and never modified again. The messages has various structures, though there are patterns that the message must fit in. This data then must be used to make a search over them. The search may be done on any fields of the message, the only always present field is the date, thus the search will be done for a specific day.
The approach I have come up so far is to use MongoDB. Each day I create a few collections (apprx. 2000) and distribute messages during the day to those collections according to the pattern. I find the patterns important because I make indexing that the number of indexes is limited to 64.
This strategy results in 500G of data + 150G of indexes = 650G per day. Of course, the question here is how to distribute those data? Obvious solution is to use Mongo Sharding and spread the collections over the shards. However, I have not find any scenario close to my problem described in mongo manuals. Moreover, I am not even sure if I can dynamically (not manually) add new collections every day to shards. Any knowledge/suggestions from expreinced users? Shoudl I change my design?

mongodb read, copy, process and delete

I have to write an app that constantly polls a mongodb db collection in a given db. If it finds documents it reads them copies them to another db, does some extra processing and deletes them from the original db.
What is the most efficient way to implement this? What are the best practices?
Is it better to process one doc at a time: read one document, copy the document then delete it
or is it better to read all documents, copy all of them, then delete all of them?
What would be the best way to handle failures in the middle of one of these read, write deletes?
Bulk reads, inserts and deletes are almost always more performant than single document actions. But try to limit it to a maximum number of documents, e.g. in our setup 500 seemed to be optimal.
For handling errors, you could use the following pseudo transaction pattern:
findAndModify while setting "state":"pending" for all read documents
process documents
bulk insert
delete all documents with "state":"pending"
If something goes wrong in the processing part or the bulk insert, you can unlock all locked documents and try again.
A more elaborate example of these kind of psuedo transactions can be found in the MongoDB Tutorial:
http://docs.mongodb.org/manual/tutorial/perform-two-phase-commits/

Is it a good idea to generate per day collections in mongodb

Is it a good idea to create per day collections for data on a given day (we could start with per day and then move to per hour if there is too much data). Is there a limit on the number of collections we can create in mongodb, or does it result in performance loss (is it an overhead for mongodb to maintain so many collections). Does a large number of collections have any adverse effect on performance?
To give you more context, the data will be more like facebook feeds, and only the latest data (say last one week or month) is more important to us. Making per day collections keeps the number of documents low, and probably would result in fast access. Even if we need old data, we can fall back to older collections. Does this make sense, or am I heading in the wrong direction?
what you actually need is to archive the old data. I would suggest you to take a look at this thread at the mongodb mailing list:
https://groups.google.com/forum/#!topic/mongodb-user/rsjQyF9Y2J4
Last post there from Michael Dirolf (10gen)says:
"The OS will handle LRUing out data, so if all of your queries are
touching the same portion of data that should stay in memory
independently of the total size of the collection."
so I guess you can stay with single collection and good indexes will do the work.
anyhow, if the collection goes too big you can always run manual archive process.
Yes, there is a limit to the number of collections you can make. From the Mongo documentation Abhishek referenced:
The limitation on the number of namespaces is the size of the namespace file divided by 628.
A 16 megabyte namespace file can support approximately 24,000 namespaces. Each index also counts as a namespace.
Indexes etc. are included in the namespaces, but even still, it would take something like 60 years to hit that limit.
However! Have you considered what happens when you want data that spans collections? In other words, if you wanted to know how many users have feeds updated in a week, you're in a bit of a tight spot. It's not easy/trivial to query across collections.
I would recommend instead making one collection to store the data and simply move data out periodically as Tamir recommended. You can easily write a job to move data out of the collection every week or every month.
Creating a collection is not much overhead, but it the overhead is larger than creating a new document inside a collections.
There is a limitation on the no of collections that you can create: " http://docs.mongodb.org/manual/reference/limits/#Number of Namespaces "
Making new collections to me, won't be having any performance difference because in RAM you cache only those data that you actually query. In your case it will be recent feeds etc.
But having per day/hour collection will help you in achieving old data very easily.

MongoDB -- large number of documents

This is related to my last question.
We have an app where we are storing large amounts of data per user. Because of the nature of data, previously we decided to create a new database for each user. This would have required a large no. of databases (probably millions) -- and as someone pointed out in a comment, that this indicated wrong design.
So we changed the design and now we are thinking about storing each user's entire information in one collection. This means one collection exactly maps to one user. Since there are 12,000 collections available per database, we can store 12,000 users per DB (and this limit could be increased).
But, now my question is -- is there any limit on the no. of documents a collection can have. Because of the way we need to store data per user, we expect to have a huge (tens of millions in extreme cases) no. of document per documents. Is that OK for MongoDB and design-wise?
EDIT
Thanks for the answers. I guess then it's OK to use large no of documents per collection.
The app is a specialized inventory control system. Each user has a large no. of little pieces of information related to them. Each piece of information has a category and some related stuff under that category. Moreover, no two collections need to see each other's data -- hence an index that touch more than one collection is not needed.
To adjust the number of collections/indexes you can have (~24k is the limit--~12k is what they say for collections because you have the _id index by default, but keep in mind, if you have more indexes on the collections, that will use namespace up as well), you can use the --nssize option when you start up mongod.
There are plenty of implementations around with billions of documents in a collection (and I'm sure there are several with trillions), so "tens of millions" should be fine. There are some numbers such as counts returned that have constraints of 64 bits, so after you hit 2^64 documents you might find some issues.
What sort of query and update load are you going to be looking at?
Your design still doesn't make much sense. Why store each user in a separate collection?
What indexes do you have on the data? If you are indexing by some field that has content that's common across all the users you'll get a significant saving in total index size by having a single collection with one index.
Index size is often the limiting factor not total database size when it comes to performance.
Why do you have so many documents per user? How large are they?
Craigslist put 2+ billion documents in MongoDB so that shouldn't be an issue if you have the hardware to support it and aren't being inefficient with your indexes.
If you posted more of your schema here you'd probably get better advice.