Scroll vs (from+size) pagination vs search_after in stateless data sync APIs - rest

I have an ES index which stores the unique key and last updated date for each document.
I need to write an APi which will be used to sync the data related to this key, either delta (based on the date stored, e.g. give me data updated after 3rd Mar 2020)
Rough ES mapping:
{
"mappings": {
"userdata": {
"_all": {
"enabled": false
},
"properties": {
"userId": {
"type": "long"
},
"userUUID": {
"type": "keyword"
},
"uniqueKey":{
"type":"keyword"
},
"updatedTimestamp":{
"type":"date"
}
}
}
}
I will use this ES index to find the list of such unique keys matching the date filter and build the remaining details for each key from cassandra.
The API is stateless.
The no. of documents matching the date filter could be in thousands to few hundred thousand.
Now, when synching such data, the client will need to paginate the results.
To paginate, I plan to use 'lastSynchedUniqueKey'. For each subsequent call, the client will provide this value and the API will internally perform a range query on this field and fetch the data with uniqueKey > lastSynchedUniqueKey
So, ES query will have following components:
search query : (date rage query) + (uniqueKey > lastSynchedUniqueKey) + (query on username)
sort : on uniqueKey in asc order
size : 100 --> this is the max pageSize (suggest if it can be changed based on total no. of documents to be synced. Only concern being, don't want to load the ES cluster with these queries. There will be other indices in the cluster which are used for user-facing searches.)
What is better option to perform pagination in this case:
pagination: using (from + size) and filter and sort param: I know this will not performant.
scroll: with same filter and sort param
ES document suggests using '_doc' for sorting for scrolls. Which is not possible in my case. Is it ok to use a field in the index instead?
Is scroll faster than search_after?
Please provide your inputs about sorting and pagination from client perspective and internally.

Related

When doing an upsert to MongoDb is it possible to set a field with a timestamp only if other data in the record has changed?

We need to cache records for a service with a terrible API.
This service provides us with API to query for data about our employees, but does not inform us whether employees are new or have been updated. Nor can we filter our queries to them for this information.
Our proposed solution to the problems this creates for us is to periodically (e.g. every 15 minutes) query all our employee data and upsert it into a Mongo database. Then, when we write to the MongoDb, we would like to include an additional property which indicates whether the record is new or whether the record has any changes since the last time it was upserted (obviously not including the field we are using for the timestamp).
The idea is, instead of querying the source directly, which we can't filter by such timestamps, we would instead query our cache which would include said timestamp and use it for a filter.
(Ideally, we'd like to write this in C# using the MongoDb driver, but more important right now is whether we can do this in an upsert call or whether we'd need to load all the records into memory, do comparisons, and then add the timestamps before upserting them....)
There might be a way of doing that, but how efficient that is, still needs to be seen. The update command in MongoDB can take an aggregation pipeline to perform an update operation. We can use the $addFields stage of MongoDB to add a new field denoting the update status, and we can use $function to compute its value. A short example is:
db.collection.update({
key: 1
},
[
{
"$addFields": {
changed: {
"$function": {
lang: "js",
"args": [
"$$ROOT",
{
"key": 1,
data: "somedata"
}
],
"body": "function(originalDoc, newDoc) { return JSON.stringify(originalDoc) !== JSON.stringify(newDoc) }"
}
}
}
}
],
{
upsert: true
})
Here's the playground link.
Some points to consider here, are:
If the order of fields in the old and new versions of the doc is not the same then JSON.stringify will fail.
The function specified in $function will run on the server-side, so ideally it needs to be lightweight. If there is a large number of users, that will get upserted, then it may or may not act as a bottleneck.

Inserting multiple key value pair data under single _id in cloudant db at various timings?

My requirement is to get json pair from mqtt subscriber at different timings under single_id in cloudant, but I'm facing error while trying to insert new json pair in existing _id, it simply replace old one. I need at least 10 json pair under one _id. Injecting at different timings.
First, you should make sure about your architectural decision to update a particular document multiple times. In general, this is discouraged, though it depends on your application. Instead, you could consider a way to insert each new piece of information as a separate document and then use a map-reduce view to reflect the state of your application.
For example (I'm going to assume that you have multiple "devices", each with some kind of unique identifier, that need to add data to a cloudant DB)
PUT
{
"info_a":"data a",
"device_id":123
}
{
"info_b":"data b",
"device_id":123
}
{
"info_a":"message a"
"device_id":1234
}
Then you'll need a map function like
_design/device/_view/state
{
function (doc) {
emit(doc.device_id, 1);
}
Then you can GET the results of that view to see all of the "info_X" data that is associated with the particular device.
GET account.cloudant.com/databasename/_design/device/_view/state
{"total_rows":3,"offset":0,"rows":[
{"id":"28324b34907981ba972937f53113ac3f","key":123,"value":1},
{"id":"d50553d206d722b960fb176f11841974","key":123,"value":1},
{"id":"eaa710a5fa1ff4ba6156c997ddf6099b","key":1234,"value":1}
]}
Then you can use the query parameters to control the output, for example
GET account.cloudant.com/databasename/_design/device/_view/state?key=123&include_docs=true
{"total_rows":3,"offset":0,"rows":[
{"id":"28324b34907981ba972937f53113ac3f","key":123,"value":1,"doc":
{"_id":"28324b34907981ba972937f53113ac3f",
"_rev":"1-bac5dd92a502cb984ea4db65eb41feec",
"info_b":"data b",
"device_id":123}
},
{"id":"d50553d206d722b960fb176f11841974","key":123,"value":1,"doc":
{"_id":"d50553d206d722b960fb176f11841974",
"_rev":"1-a2a6fea8704dfc0a0d26c3a7500ccc10",
"info_a":"data a",
"device_id":123}}
]}
And now you have the complete state for device_id:123.
Timing
Another issue is the rate at which you're updating your documents.
Bottom line recommendation is that if you are only updating the document once per ~minute or less frequently, then it could be reasonable for your application to update a single document. That is, you'd add new key-value pairs to the same document with the same _id value. In order to do that, however, you'll need to GET the full doc, add the new key-value pair, and then PUT that document back to the database. You must make sure that your are providing the most recent _rev of that document and you should also check for conflicts that could occur if the document is being updated by multiple devices.
If you are acquiring new data for a particular device at a high rate, you'll likely run into conflicts very frequently -- because cloudant is a distributed document store. In this case, you should follow something like the example I gave above.
Example flow for the second approach outlined by #gadamcox for use cases where document updates are not required very frequently:
[...] you'd add new key-value pairs to the same document with the same _id value. In order to do that, however, you'll need to GET the full doc, add the new key-value pair, and then PUT that document back to the database.
Your application first fetches the existing document by id: (https://docs.cloudant.com/document.html#read)
GET /$DATABASE/100
{
"_id": "100",
"_rev": "1-2902191555...",
"No": ["1"]
}
Then your application updates the document in memory
{
"_id": "100",
"_rev": "1-2902191555...",
"No": ["1","2"]
}
and saves it in the database by specifying the _id and _rev (https://docs.cloudant.com/document.html#update)
PUT /$DATABASE/100
{
"_id": "100",
"_rev": "1-2902191555...",
"No":["1","2"]
}

What is the best way to store column oriented table in MongoDB for optimal query of data

I have a large table where the columns are user_id, user_feature_1, user_feature_2, ...., user_feature_n
So each row corresponds to a user and his or her features.
I stored this table in MongoDB by storing each column's values as an array, e.g.
{
'name': 'user_feature_1',
'values': [
15,
10,
...
]
}
I am using Meteor to pull data from MongoDB, and this way of storage facilitates fast and easy retrieval of the whole column's values for graph plotting.
However, this way of storing has a major drawback; I can't store arrays larger than 16mb.
There are a couple of possible solutions, but non of them seems good enough:
Store each column's values using gridFS. I am not sure if meteor supports gridFS, and it lacks support for slicing of the data, i.e., I may need to just get the top 1000 values of a column.
Store the table in row oriented format. E.g.
{
'user_id': 1,
'user_feature_1': 10,
'user_feature_2': 0.9,
....
'user_feature_n': 42
}
But I think this way of storing data is inefficient for querying a feature column's values
Or MongoDB is not suitable at all and sql is the way to go? But Meteor does not support sql
Update 1:
I found this interesting article which talks about array in mongodb is inefficient. https://www.mongosoup.de/blog-entry/Storing-Large-Lists-In-MongoDB.html
Following explanation is from http://bsonspec.org/spec.html
Array - The document for an array is a normal BSON document with integer values for the keys, starting with 0 and continuing sequentially. For example, the array ['red', 'blue'] would be encoded as the document {'0': 'red', '1': 'blue'}. The keys must be in ascending numerical order.
This means that we can store at most 1 million values in a document, if the values and keys are of float type (16mb/128bits)
There is also a third option. A separate document for each user and feature:
{ u:"1", f:"user_feature_1", v:10 },
{ u:"1", f:"user_feature_2", v:11 },
{ u:"1", f:"user_feature_3", v:52 },
{ u:"2", f:"user_feature_1", v:4 },
{ u:"2", f:"user_feature_2", v:13 },
{ u:"2", f:"user_feature_3", v:12 },
You will have no document growth problems and you can query both "all values for user x" and "all values for feature x" without also accessing any unrelated data.
16MB / 64bit float = 2,000,000 uncompressed datapoints. What kind of graph requires a minimum of 2 million points per column??? Instead try:
Saving a picture on an s3 server
Using a map-reduce solution like hadoop (probably your best bet)
Reducing numbers to small ints if they're currently floats
Computing the data on the fly, on the client (preferred, if possible)
Using a compression algo so you can save a subset & interpolate the rest
That said, a document-based DB would outperform a SQL DB in this use case because a SQL DB would do exactly as Philipp suggested. Either way, you cannot send multiple 16MB files to a client, if the client doesn't leave you for poor UX then you'll go broke for server costs :-).

How to create an index in MongoDB which calls a JS function via system.js?

I have two collections viz. whitelist (id, count, expiry) and blacklist (id).
Now i would like to create an index such that when count>=200 then call a JS function which will remove the document from whitelist and add the id to blacklist.
So can i do this in Mongo using db.collection.createindex({"count":1}, ???);
or do i need to write a daemon to scan the entire collection? or is there any better method for the same?
You seem to be asking for what in a SQL relational database we would call a "trigger", which is something completely different from an "index" even in that world.
In the NoSQL world typically and especially with MongoDB, that sort of "server logic" is relegated to the "client" code operations rather than the server. Think of it as another part of the "scalability" philosphy of these products, where certain functions like "triggers" are taken away due to the stance that these "cost" a lot with distributed data.
So in order to do what you want you do it in "code" instead of defining a database "trigger". The process is simple enough, via .findAndModify() and other wrapping variants available to langauge API's:
// Increment below 200 and return the modified document
var doc = db.whitelist.findAndModify({
"query": { "_id": myId, "count": { "$lt": 200 } }
"update": { "count": { "$inc": 1 } },
"new": true
});
// Then remove the blacklist where the value meets conditions
if ( doc.hasOwnProperty("count") {
if ( doc.count >= 200 )
db.blacklist.remove({ "_id": myId });
}
Be careful with the actual language API method variant as the structure typically differs fromt the "query/update" keys as is provided in the shell method.
The basic principles remain the same. Modifiy and fetch, then remove from the other collection if your conditions are met. But it is "two" trips to the server, and there is no way to make the server "trigger" when such a condition is met by itself.
db.whitelist.insert(doc);
if(db.whitelist.find(criterion).count() >= 200) {
var bulkRemove = db.whitelist.initializeUnorderedBulkOp();
var bulkInsert = db.blacklist.initializeUnorderedBulkOp();
db.whitelist.find(criterion).forEach(
function(doc){
bulkInsert.insert({_id:doc._id});
bulkRemove.find({doc._id}).removeOne();
}
);
bulkInsert.execute();
bulkRemove.execute();
}
First, you insert the document as usual. Since criterion is going to use an index, the if clause should be determined fast and efficiently.
In case we have 200 or more documents matching that criterion, we use bulk operations to insert the ids into the blacklist and remove the documents from the whitelist, which will be executed in parallel.
The problem with only writing the _id to the backlist is that you need to check wether the criterion for being blacklisted is matched, so the _id needs to contain that criterion.
A better solution IMHO is to flag entries of a single collection using a field named blacklisted for individual entries or to use the aggregation framework to find blacklisted documents and write them to an a collection using the out pipeline stage. Sadly, you didn't give example data or a proper description of your use case, so you get an unspecified answer.

DynamoDB Model/Keys Advice

I was hoping someone could help me understand how to best design my table(s) for DynamoDb. I'm building an application which is used to track the visits a certain user makes to another user's profile.
Currently I have a MongoDB where one entry contains the following fields:
userId
visitedProfileId
date
status
isMobile
How would this translate to DynamoDB in a way it would not be too slow? I would need to do search queries to select all items that have a certain userId, taking the status and isMobile in affect. What would me keys be? Can I use limit functionality to only request the latest x entries (sorted on date?).
I really like the way DynamoDB can be used but it really seems kind of complicated to make the click between a regular NoSQL database and a key-value nosql database.
There are a couple of ways you could do this - and it probably depends on any other querying you may want to do on this table.
Make your HashKey of the table the userId, and then the RangeKey can be <status>:<isMobile>:<date> (eg active:true:2013-03-25T04:05:06.789Z). Then you can query using BEGINS_WITH in the RangeKeyCondition (and ScanIndexForward set to false to return in ascending order).
So let's say you wanted to find the 20 most recent rows of user ID 1234abcd that have a status of active and an isMobile of true (I'm guessing that's what you mean by "taking [them] into affect"), then your query would look like:
{
"TableName": "Users",
"Limit": 20,
"HashKeyValue": { "S": "1234abcd" },
"RangeKeyCondition": {
"ComparisonOperator": "BEGINS_WITH"
"AttributeValueList": [{ "S": "active:true:" }],
},
"ScanIndexForward": false
}
Another way would be to make the HashKey <userId>:<status>:<isMobile>, and the RangeKey would just be the date. You wouldn't need a RangeKeyCondition in this case (and in the example, the HashKeyValue would be { "S": "1234abcd:active:true" }).