How do I track down slow queries in Cloudant? - ibm-cloud

I have some queries running against my Cloudant service. Some of them return quickly but a small minority are slower than expected. How can I see which queries are running slowly?

IBM Cloud activity logs can be sent to LogDNA Activity Tracker - each log item has latency measurements allowing you to identify which queries are running slower than others. For example, a typical log entry will look like this:
{
"ts": "2021-11-30T22:39:58.620Z",
"accountName": "xxxxx-yyyy-zzz-bluemix",
"httpMethod": "POST",
"httpRequest": "/yourdb/_find",
"responseSizeBytes": 823,
"clientIp": "169.76.71.72",
"clientPort": 31393,
"statusCode": 200,
"terminationState": "----",
"dbName": "yourdb",
"dbRequest": "_find",
"userAgent": "nodejs-cloudant/4.5.1 (Node.js v14.17.5)",
"sslVersion": "TLSv1.2",
"cipherSuite": "ECDHE-RSA-CHACHA20-POLY1305",
"requestClass": "query",
"parsedQueryString": null,
"rawQueryString": null,
"timings": {
"connect": 0,
"request": 1,
"response": 2610,
"transfer": 0
},
"meta": {},
"logSourceCRN": "crn:v1:bluemix:public:cloudantnosqldb:us-south:a/abc12345:afdfsdff-dfdf34-6789-87yh-abcr45566::",
"saveServiceCopy": false
}
The timings object contains various measurements, including the response time for the query.
For compliance reasons, the actual queries are not written to the logs, so to match queries to log entries you could put a unique identifier in the query string of the request, which would appear in the rawQueryString parameter of the log entry.
For more information on logging see this blog post.
Another option is to simply measure HTTP round-trip latency.
Once you have found your slow queries, have a look at this post for ideas on how to optimise queries.

Related

BulkDocs Api used to save CouchDb document is taking more time compare to put method?

As we checked in the chrome network tab is shown both Api request and response timing.
Analysis that timings, the bulk docs Api is taking 2x of time to complete the document save in CouchDB. sometimes this 2x time is changed to 3 or 4x it depending on the waiting server response time.
At the same time, The PUT method takes 1/4 time to save the data, and this PUT request is called from another API. It looks like saving records using PUT requests is faster than using BulkDocs API.
Here, I have mentioned the request and response, and the screenshot for your reference.
BulkDocs Request:
{"docs":[{"_id":"pfm718215_2_BE1A8AC4-EB53-4C8E-B3F7-5D4FB4329963","data":{"pfm718093_1595329":null,"pfm_718215_id":null,"createdby":52803,"createdon":1665575674775,"lookupname":null,"lookupmail":null,"lastmodifiedby":52803,"lastmodifiedon":1665575674775,"guid":"Xj0JpEofDDy37Z2","name":"test","pfm_718093_1595327_id":null,"display_name":"pfm718215","couch_id":null,"couch_rev_id":null,"pfm718093_1595325":null,"pfm_718093_1595325_id":null,"pfm718093_1595327":null,"pfm_718093_1595329_id":null,"type":"pfm718215","sync_flag":"C","org_id":3}}],"new_edits":true}
BulkDocs Response :
[{
"ok": true,
"id": "pfm718215_2_BE1A8AC4-EB53-4C8E-B3F7-5D4FB4329963",
"rev": "1-05f3e8e3e96844cb51a8143891b81d16"
}]
BulkDocs Timings Screenshot :
Header
Request
Timings
PUT Request :
{"webserviceInput":{"processInfo":{"orgId":3,"userId":52803},"dataParams":{"data":{"pfm718093_1595329":null,"pfm_718215_id":null,"createdby":52803,"createdon":1665569303482,"lookupname":null,"lookupmail":null,"lastmodifiedby":52803,"lastmodifiedon":1665569303482,"guid":"DRtlY2FlKAwVHBq","name":"test","pfm_718093_1595327_id":null,"display_name":"pfm718215","couch_id":null,"couch_rev_id":null,"pfm718093_1595325":null,"pfm_718093_1595325_id":null,"pfm718093_1595327":null,"pfm_718093_1595329_id":null,"type":"pfm718215","sync_flag":"C","org_id":3}},"sessionType":"NODEJS"}}
PUT Response :
{
"ok": true,
"id": "5f1eee08c843d01257c8b698d923fb02",
"rev": "1-0f67c9b8c2acf7aead7e991e344b04df"
}
PUT Timings Screenshot :
Header
Request
Timing
CouchDB version Details :
Couchdb 3.2.0 and {“erlang_version":"20.3.8.26","javascript_engine":{"name":"spidermonkey","version":"1.8.5"}}

API provides pagination functionality - client wants to query beyond bounds of API

For a web project, I am consuming an API that returns Educational Materials (books, videos, etc) -- in simplicity you can request:
API accepted parameters :
type: accepts 1 or many: [book, video, software]
subject matter: accepts 1 or many: [science, math, english, history]
per page: accepts an integer, defaults to 2, 0 returns ALL results
page: accepts an integer, defaults to 1
Important: This is a contrived example of a real use case, so it's not just 1 or 2 requests I'd have to cache, it's almost an infinite amount of combinations.
and it returns objects that look like:
{
"total-results": 15,
"page": 1,
"per-page": 2,
"data": [
{
"title": "Foobar",
"type": "book",
"subject-matter": [
"history",
"science"
],
"age": 10
},
{
"title": "Barfoo",
"type": "video",
"subject-matter": [
"history"
],
"age": 14
}
]
}
The client wants to be able to allow users to filter by age on my site -- so I have to essentially query everything and re-run my pagination.
I'd like to suggest to the API team (which we control) to allow me to query by age as well, but trying to explain this concept to the business is proving fruitless.
Right now all that I can think to solve this are 2 options: (1) convince the API team to allow me to query by age or (2) to cache the life out of my requests and use "0" by default and handle pagination on my end.
Again, Important: This is a contrived example of a real use case, so it's not just 1 or 2 requests I'd have to cache, it's almost an infinite amount of combinations.
Anyone have experience dealing with something similar to this?
Edit: Eric Stein asked a very sound question, here is the Q & A:
His Q: "Your API team does not know how to filter by age?"
My A: "They may it's a HUGE organization and I may get stonewalled because of bureaucracy and want to prepare for the worst."
I worked in a project that we consumed an API and had to make more filters that the API allowed (the API wasn't ours). In that case what we decided was to create a cron script that consumed the API and registered the returned data in an database of our own. We had a lot of problems maintaining that (it was A LOT of data), but kinda worked for us (at least for the time I was working in the project).
I think if it's important to your application (and for your client) that you can age filter, that's a pretty good argument to convince the API team to allow that.

REST API sub resources, data to return?

If we have customers and orders, I'm looking for the correct RESTful way to get this data:
{
"customer": {
"id": 123,
"name": "Jim Bloggs"
"orders": [
{
"id": 123,
"item": "Union Jack Keyring",
"qty": 1
}, {
"id": 987,
"item": "London Eye Ticket",
"qty": 5
}
]
}
}
GET /customers/123/orders
GET /customers/123?inc-orders=1
Am I correct that the last part/folder of the URL, excluding query string params, should be the resource returned..?
If so, number 1 should only return order data and not include the customer data. While number 2 is pointing directly at customer 123 and uses query string params to effect/filter the customer data returned, in this case including the order data.
Which of these two calls is the correct RESTful call for the above JSON..? ...or is there a secret number 3 option..?
You have 3 options which I think could be considered RESTful.
1)
GET /customers/12
But always include the orders. Do you have a situation in which the client would not want to use the orders? Or can the orders array get really big? If so you might want another option.
2)
GET /customers/123, which could include a link to their orders like so:
{
"customer": {
"id": 123,
"name": "Jim Bloggs"
"orders": {
"href": "<link to you orders go here>"
}
}
}
With this your client would have to make 2 requests to get a customer and their orders. Good thing about this way though is that you can easily implement clean paging and filtering on orders.
3)
GET /customers/123?fields=orders
This is similar to your second approach. This will allow clients to use your API more efficiently, but I wouldn't go this route unless you really need to limit the fields that are coming back from your server. Otherwise it will add unnecessary complexity to your API which you will have to maintain.
The Resource (identified by the complete URL) is the same, a customer. Only the Representation is different, with or without embedded orders.
Use Content Negotiation to get different Representations for the same Resource.
Request
GET GET /customers/123/
Accept: application/vnd.acme.customer.short+json
Response
200 OK
Content-Type: application/vnd.acm.customer.short+json
{
"customer": {
"id": 123,
"name": "Jim Bloggs"
}
}
Request
GET GET /customers/123/
Accept: application/vnd.acme.customer.full+json
Response
200 OK
Content-Type: application/vnd.acme.customer.full+json
{
"customer": {
"id": 123,
"name": "Jim Bloggs"
"orders": [
{
"id": 123,
"item": "Union Jack Keyring",
"qty": 1
}, {
"id": 987,
"item": "London Eye Ticket",
"qty": 5
}
]
}
}
The JSON that you posted looks like what would be the result of
GET /customers/123
provided the Customer resource contains a collection of Orders as a property; alternatively you could either embed them, or provide a link to them.
The latter would result in something like this:
GET /customers/123/orders
which would return something like
{
"orders": [
{
"id": 123,
"item": "Union Jack Keyring",
"qty": 1
},
{
"id": 987,
"item": "London Eye Ticket",
"qty": 5
}
]
}
I'm looking for the correct RESTful way to get this data
Simply perform a HTTP GET request on a URI that points to a resource that produces this data!
TL;DR
REST does not care about URI design - but on its constraints!
Clients perform state transitions through possible actions returned by the server through dynamically identified hyperlinks contained within the response.
Clients and servers can negotiate on a preferred hypermedia type
Instead of embedding the whole (sub-)resource consider only returning the link to that resource so a client can look it up if interested
First, REST does not really care about the URI design as long as the URI is unique. Sure, a simple URI design is easier to understand for humans, though if compared to HTML the actual link can be hidden behind a more meaninful text and is thus also not that important for humans also as long as they are able to find the link and can perform an action against it. Next, why do you think your "response" or API is RESTful? To call an API RESTful, the API should respect a couple of constraints. Among these constraints is one quite buzzword-famous: hypertext as the engine of application state (HATEOAS).
REST is a generalized concept of the Web we use every day. A quite common task for a web-session is that a client requests something where the server sends a HTML document with plenty of links and other resources the client can use to request further pages or stream a video (or what ever). A user operationg on a client can use the returned information to proceed further, request new pages, send information to the server etc, etc. The same holds true for RESTful applications. This is was REST simply defines as HATEOAS. If you now have a look at your "response" and double check with the HATEOAS constraint you might see that your response does not contain any links to start with. A client therefore needs domain knowledge to proceed further.
JSON itself isn't the best hypermedia type IMO as it only defines the overall syntax of the data but does not carry any semantics, similar to plain XML which though may have some DTD or schemas a client may use to validate the document and check if further semantic rules are available elsewhere. There are a couple of hypermedia types that build up on JSON that are probably better suited like f.e. application/hal+json (A good comparison of JSON based hypermedia types can be found in this blog post). You are of course entitled to define your own hypermedia type, though certain clients may not be able to understand it out of the box.
If you take f.e. a look at HAL you see that it defines an _embedded element where you can put in certain sub-resources. This seems to be ideal in your case. Depending on your design, orders could also be a resource on its own and thus be reachable via GET /orders/{orderId} itself. Instead of embedding the whole sub-resource, you can also just include the link to that (sub)resource so a client can look up the data if interested.
If there are cases where you want to return only customer data and other cases where you want also to include oder data you can f.e. define different hypermedia types (based on HAL f.e.) for both, one returning just the customer data while the other also includes the oder data. These types could be named like this: application/vnd.yourcompanyname.version.customers.hal+json or application/vnd.yourcompanyname.version.customer_orders.hal+json. While this is for sure an development overhead compared to adding a simple query-parameter to the request, the semantics are more clear and the documentation overhead is on the hypermedia type (or representation) rather then the HTTP operation.
You can of course also define some kind of view structure where one view only returns the customer data as is while a different view returns the customer data including the orders similar to a response I gave on a not so unrelated topic.

Normalized vs denormalized data in mongo

I have the following schema for posts. Each post has an embedded author and attachments (array of links / videos / photos etc).
{
"content": "Pixable tempts Everpix users with quick-import tool for photos ahead of December 15 closure http:\/\/t.co\/tbsSrVYneK by #psawers",
"author": {
"username": "TheNextWeb",
"id": "10876852",
"name": "The Next Web",
"photo": "https:\/\/pbs.twimg.com\/profile_images\/378800000147133877\/895fa7d3daeed8d32b7c089d9b3e976e_bigger.png",
"url": "https:\/\/twitter.com\/account\/redirect_by_id?id=10876852",
"description": "",
"serviceName": "twitter"
},
"attachments": [
{
"title": "Pixable tempts Everpix users with quick-import tool for photos ahead of December 15 closure",
"description": "Pixable, the SingTel-owned company that organizes your social photos in smart ways, has announced a quick-import tool for Everpix users following the company's decision to close ...",
"url": "http:\/\/t.co\/tbsSrVYneK",
"type": "link",
"photo": "http:\/\/cdn1.tnwcdn.com\/wp-content\/blogs.dir\/1\/files\/2013\/09\/camera1-.jpg"
}
]
}
Posts are read often (we have a view with 4 tabs, each tab requires 24 posts to be shown). Currently we are indexing these lists in Redis, so querying 4x24posts is as simple as fetching the lists from Redis (returns a list of mongo ids) and querying posts with the ids.
Updates on the embedded author happen rarely (for example when the author changes his picture). The updates do not have to be instantaneous or even fast.
We're wondering if we should split up the author and the post into two different collections. So a post would have a reference to its author, instead of an embedded / duplicated author. Is a normalized data state preferred here (author is duplicated for every post, resulting in a lot of duplicated data / extra bytes)? Or should we continue with the de-normalized state?
As it seems that you have a few magnitudes more reads than writes, it probably makes little sense to split this data out into two collections. Especially with few updates, and you needing almost all author information while showing posts one query is going to be faster than two. You also get data locality so potentially you would need less data in memory as well, which should provide another benefit.
However, you can only really find out by benchmarking this with the amount of data that you'd be using in production.

Document design

I am trying out some different options to design and store a document structure in an efficient way in RavenDB.
The structure I am handling user is Session and activity tracking information.
A Session is started when a User logs into the system and the activities start getting created. There could be hundreds activities per session.
The session ends when the user closes / logs out.
A factor that complicates the scenario slightly is that the sessions are displayed in a web portal in real time. In other words: I need to keep track of the session and activities and correlate them to be able to find out if they are ongoing (and how long they have been running) or if they are done.
You can also dig around in the history of course.
I did some research and found two relevant questions here on stack overflow but none of them really helped me:
Document structure for RavenDB
Activity stream design with RavenDb
The two options I have spiked successfully are: (simplified structures)
1:
{
"User": "User1",
"Machine": "machinename",
"StartTime": "2012-02-13T13:11:52.0000000",
"EndTime": "2012-02-13T13:13:54.0000000",
"Activities": [
{
"Text": "Loaded Function X",
"StartTime": "2012-02-13T13:12:10.0000000",
"EndTime": "2012-02-13T13:12:10.0000000"
},
{
"Text": "Executed action Z",
"StartTime": "2012-02-13T13:12:10.0000000",
"EndTime": "2012-02-13T13:12:10.0000000"
}
}
2:
{
"Session" : "SomeSessionId-1",
"User": "User1",
"Machine": "machinename",
"Text": "Loaded Function X",
"StartTime": "2012-02-13T13:12:10.0000000",
"EndTime": "2012-02-13T13:12:10.0000000"
}
{
"Session" : "SomeSessionId-1",
"User": "User1",
"Machine": "machinename",
"Text": "Executed action Z",
"StartTime": "2012-02-13T13:12:10.0000000",
"EndTime": "2012-02-13T13:12:10.0000000"
}
Alternative 1 feels more natural, comming from a relational background and it was really simple to load up a Session, add events and store away. The overhead of loading a Session object and the appending events every time feels really bad for insert performance.
Alternative 2 feels much more efficient, I can simply append events (almost like event-sourcing). But the selections when digging around in events and showing them per session gets a bit more complex.
Is there perhaps a third better alternative?
Could the solution be to separate the events and create another read model?
Am I overcomplicating the issue?
I definitely think you should go with some variant of option 2. Won't the documents grow very large in option 1? That would probably make the inserts very slow.
I can't really see why showing events per session would be any more complicated in option 2 than in option 1, you can just select events by session with
session.Query<Event>().Where(x => x.Session == sessionId)
and RavenDB will automatically create an index for it. And if you want to make more complicated queries you could always create more specialized indexes for that.
Looks like you just need a User document and a session document. Create two models for "User" and "Session".. session doc would have userid as one property. Session will have nested "activity" properties also. It will be easy to show real time users - sessions - activities in this case. Without knowing more details, I'm over simplifying ofcourse.
EDIT:
//Sample User Document
{
UserId:"ABC01",
HomeMachine:"xxxx",
DateCreated:"12/12/2011"
}
//Sample Session Document
{
UserId:"ABC01",
Activities
{
Activity 1 properties
}
{
Activity 2 properties
}
...
...
etc..
}