Why does "missing" mean when I look at the revision history of a Cloudant document - ibm-cloud

When I try to see the revision history of a document in Cloudant using the /<DB>/<doc_id>?revs_info=true call I get the following:
"_revs_info": [
{
"rev": "4-xxx",
"status": "available"
},
{
"rev": "3-xxx",
"status": "missing"
},
{
"rev": "2-xxx",
"status": "missing"
},
{
"rev": "1-xxx",
"status": "missing"
}
What does status:missing mean?

In Cloudant, old revisions are regularly purged using a process called compaction, which is designed to manage the size of your database.
Once a revision of a document is superseded, it eventually gets compacted by a background task. Once that happens, the revision's content is no longer available and you get the missing status message.
Because compaction works in the background asynchronously, you should not think of or use revisions as a way of accessing old versions of documents because it will let you down!
There is more information about revisions in this blog post

Related

How do I track down slow queries in Cloudant?

I have some queries running against my Cloudant service. Some of them return quickly but a small minority are slower than expected. How can I see which queries are running slowly?
IBM Cloud activity logs can be sent to LogDNA Activity Tracker - each log item has latency measurements allowing you to identify which queries are running slower than others. For example, a typical log entry will look like this:
{
"ts": "2021-11-30T22:39:58.620Z",
"accountName": "xxxxx-yyyy-zzz-bluemix",
"httpMethod": "POST",
"httpRequest": "/yourdb/_find",
"responseSizeBytes": 823,
"clientIp": "169.76.71.72",
"clientPort": 31393,
"statusCode": 200,
"terminationState": "----",
"dbName": "yourdb",
"dbRequest": "_find",
"userAgent": "nodejs-cloudant/4.5.1 (Node.js v14.17.5)",
"sslVersion": "TLSv1.2",
"cipherSuite": "ECDHE-RSA-CHACHA20-POLY1305",
"requestClass": "query",
"parsedQueryString": null,
"rawQueryString": null,
"timings": {
"connect": 0,
"request": 1,
"response": 2610,
"transfer": 0
},
"meta": {},
"logSourceCRN": "crn:v1:bluemix:public:cloudantnosqldb:us-south:a/abc12345:afdfsdff-dfdf34-6789-87yh-abcr45566::",
"saveServiceCopy": false
}
The timings object contains various measurements, including the response time for the query.
For compliance reasons, the actual queries are not written to the logs, so to match queries to log entries you could put a unique identifier in the query string of the request, which would appear in the rawQueryString parameter of the log entry.
For more information on logging see this blog post.
Another option is to simply measure HTTP round-trip latency.
Once you have found your slow queries, have a look at this post for ideas on how to optimise queries.

Is keeping a log of all document relationships an anti-pattern in CouchDB?

When we return each document in our database to be consumed by the client we also must to add a property "isInUse" to that document's response payload to indicate if a given documented is referenced by other documents .
This is needed because referenced documents cannot be deleted and so a trash bin button should not be displayed next to it's listing entry in the client-side app.
So basically we have relationships where a document can reference another link this:
{
"_id": "factor:1I9JTM97D",
"someProp": 1,
"otherProp": 2,
"defaultBank": <id of some bank document>
}
Previously we have used views and selectors to query for each documents references in other documents, however this proved to be non-trivial.
So here's how someone in our team has implemented this now: We register all relationships in dedicated "relationship" documents like the one below and update them every time a document created/updated/deleted by the server, to reflect anything new references or de-references:
{
"_id": "docInUse:bank",
"_rev": "7-f30ffb403549a00f63c6425376c99427",
"items": [
{
"id": "bank:1S36U3FDD",
"usedBy": [
"factor:1I9JTM97D"
]
},
{
"id": "bank:M6FXX6UA5",
"usedBy": [
"salesCharge:VDHV2M9I1",
"salesCharge:7GA3BH32K"
]
}
]
}
The question is whether this solution is an anti-pattern and what are the potential drawbacks.
I would say using a single document to record the relationships between all other documents could be problematic because
the document "docInUse:bank" could end up being updated frequently. Cloudant allows you to update documents but when you get to many thousands of revisions, then the document size becomes none trivial, because all the previous revision tokens are retained
updating a central document invites the problem of document conflicts if two processes attempt to update the document at the same time. You are allowed to have have conflicts, but it is your app's responsibility to manage them see here
if you have lots of relationships, this document could get very large (I don't know enough about your app to judge)
Another solution is to keep your bank:*, factor:* & salesCharge:* documents the same and create a document per relationship e.g.
{
"_id": "1251251921251251",
"type": "relationship",
"doc": "bank:1S36U3FDD",
"usedby": "factor:1I9JTM97D"
}
You can then find out documents on either side of the "join" by querying documents by the value of doc or usedby with a suitable index.
I've also seen implementations, where the document's _id field contains all of the information:
{
"_id": "bank:1S36U3FDD:factor:1I9JTM97D"
"added": "2018-02-28 10:24:22"
}
and the primary key helpfully sorts the document ids for you allowing you to use judicious use of GET /db/_all_docs?startkey=x&endkey=y to fetch the relationships for the given bank id.
If you need to undo a relationship, just delete the document!
By building a cache of relationships on every document create/update/delete as you currently implemented it, you are basically recreating an index manually in the database. This is the reason why I would lean towards calling it an antipattern.
One great way to improve your design is to store each relation as a separate document as Glynn suggested.
If your concern is consistency (which I think might be the case, judging by looking at the document types you mentioned), try to put all information about a transaction into a single document. You can define the relationships in a consistent place in your documents, so updating the views would not be necessary:
{
"_id":"salesCharge:VDHV2M9I1",
"relations": [
{ "type": "bank", "id": "bank:M6FXX6UA5" },
{ "type": "whatever", "id": "whatever:xy" }
]
}
Then you can keep your views consistent, and you can rely on CouchDB to keep the "relation cache" up to date.

Find document and use its contents to update it in one atomic operation

I have to carry out certain multi-step operations on a set of documents in MongoDB. For instance, my documents each have two sub-documents in them like:
{
"history": [
],
"preview": {
"title": "This is the preview title",
"subtitle": "This is a new field"
},
"live": {
"title": "This is the live title"
}
}
I have to find the document, do a diff between the preview and live states, push the diff into history and copy preview into live. How can I do a multi-step operation like this in MongoDB in an atomic way so the data doesn't change mid-operation?
The only thing I can think of is setting a flag on the document using findAndModify to prevent updates in between and then removing the flag with the final update call. Can anyone suggest a better solution?
I should point out that sometimes, I need to publish multiple items in one-go so performance is important to me.

Normalized vs denormalized data in mongo

I have the following schema for posts. Each post has an embedded author and attachments (array of links / videos / photos etc).
{
"content": "Pixable tempts Everpix users with quick-import tool for photos ahead of December 15 closure http:\/\/t.co\/tbsSrVYneK by #psawers",
"author": {
"username": "TheNextWeb",
"id": "10876852",
"name": "The Next Web",
"photo": "https:\/\/pbs.twimg.com\/profile_images\/378800000147133877\/895fa7d3daeed8d32b7c089d9b3e976e_bigger.png",
"url": "https:\/\/twitter.com\/account\/redirect_by_id?id=10876852",
"description": "",
"serviceName": "twitter"
},
"attachments": [
{
"title": "Pixable tempts Everpix users with quick-import tool for photos ahead of December 15 closure",
"description": "Pixable, the SingTel-owned company that organizes your social photos in smart ways, has announced a quick-import tool for Everpix users following the company's decision to close ...",
"url": "http:\/\/t.co\/tbsSrVYneK",
"type": "link",
"photo": "http:\/\/cdn1.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2013\/09\/camera1-.jpg"
}
]
}
Posts are read often (we have a view with 4 tabs, each tab requires 24 posts to be shown). Currently we are indexing these lists in Redis, so querying 4x24posts is as simple as fetching the lists from Redis (returns a list of mongo ids) and querying posts with the ids.
Updates on the embedded author happen rarely (for example when the author changes his picture). The updates do not have to be instantaneous or even fast.
We're wondering if we should split up the author and the post into two different collections. So a post would have a reference to its author, instead of an embedded / duplicated author. Is a normalized data state preferred here (author is duplicated for every post, resulting in a lot of duplicated data / extra bytes)? Or should we continue with the de-normalized state?
As it seems that you have a few magnitudes more reads than writes, it probably makes little sense to split this data out into two collections. Especially with few updates, and you needing almost all author information while showing posts one query is going to be faster than two. You also get data locality so potentially you would need less data in memory as well, which should provide another benefit.
However, you can only really find out by benchmarking this with the amount of data that you'd be using in production.

Document design

I am trying out some different options to design and store a document structure in an efficient way in RavenDB.
The structure I am handling user is Session and activity tracking information.
A Session is started when a User logs into the system and the activities start getting created. There could be hundreds activities per session.
The session ends when the user closes / logs out.
A factor that complicates the scenario slightly is that the sessions are displayed in a web portal in real time. In other words: I need to keep track of the session and activities and correlate them to be able to find out if they are ongoing (and how long they have been running) or if they are done.
You can also dig around in the history of course.
I did some research and found two relevant questions here on stack overflow but none of them really helped me:
Document structure for RavenDB
Activity stream design with RavenDb
The two options I have spiked successfully are: (simplified structures)
1:
{
"User": "User1",
"Machine": "machinename",
"StartTime": "2012-02-13T13:11:52.0000000",
"EndTime": "2012-02-13T13:13:54.0000000",
"Activities": [
{
"Text": "Loaded Function X",
"StartTime": "2012-02-13T13:12:10.0000000",
"EndTime": "2012-02-13T13:12:10.0000000"
},
{
"Text": "Executed action Z",
"StartTime": "2012-02-13T13:12:10.0000000",
"EndTime": "2012-02-13T13:12:10.0000000"
}
}
2:
{
"Session" : "SomeSessionId-1",
"User": "User1",
"Machine": "machinename",
"Text": "Loaded Function X",
"StartTime": "2012-02-13T13:12:10.0000000",
"EndTime": "2012-02-13T13:12:10.0000000"
}
{
"Session" : "SomeSessionId-1",
"User": "User1",
"Machine": "machinename",
"Text": "Executed action Z",
"StartTime": "2012-02-13T13:12:10.0000000",
"EndTime": "2012-02-13T13:12:10.0000000"
}
Alternative 1 feels more natural, comming from a relational background and it was really simple to load up a Session, add events and store away. The overhead of loading a Session object and the appending events every time feels really bad for insert performance.
Alternative 2 feels much more efficient, I can simply append events (almost like event-sourcing). But the selections when digging around in events and showing them per session gets a bit more complex.
Is there perhaps a third better alternative?
Could the solution be to separate the events and create another read model?
Am I overcomplicating the issue?
I definitely think you should go with some variant of option 2. Won't the documents grow very large in option 1? That would probably make the inserts very slow.
I can't really see why showing events per session would be any more complicated in option 2 than in option 1, you can just select events by session with
session.Query<Event>().Where(x => x.Session == sessionId)
and RavenDB will automatically create an index for it. And if you want to make more complicated queries you could always create more specialized indexes for that.
Looks like you just need a User document and a session document. Create two models for "User" and "Session".. session doc would have userid as one property. Session will have nested "activity" properties also. It will be easy to show real time users - sessions - activities in this case. Without knowing more details, I'm over simplifying ofcourse.
EDIT:
//Sample User Document
{
UserId:"ABC01",
HomeMachine:"xxxx",
DateCreated:"12/12/2011"
}
//Sample Session Document
{
UserId:"ABC01",
Activities
{
Activity 1 properties
}
{
Activity 2 properties
}
...
...
etc..
}