tracking multiple documents with an index document in nosql DB

tracking multiple documents with an index document in nosql DB - nosql

I have an app where a user can create tasks with configuration (scheduling, retries etc..)
each task is given a guid and stored under this guid as a key in a nosql db (couchbase)
then I'm managing all these tasks and their status in an index docuement per account
so index looks like
[
{"id":"12345-98889-0000-1111", "status":"...", ...},
{"id":"12345-98889-0000-2222", "status":"...", ...},
{"id":"12345-98889-0000-3333", "status":"...", ...},
{"id":"12345-98889-0000-4444", "status":"...", ...},
]
every record in this json array points to a document with the full configuration of the task.
anyhow right now I have an account with and index document of 11Mb which getting pretty heavy on updating, need to get update in memory and the re-upload.
any suggestion on restructure here ?
I was thinking using elasticsearch as the index instead of managing a big json doc but im not sure that I want to introduce another component here ...

Any reason not to use the N1QL query language? You are already storing the documents. Create GSI indexes on whatever fields you are interested in, and use N1QL queries to access the documents. The indexes should be used intelligently for efficient access. You just need to make sure your server is running the query service.

Related

Create collections in cloudant

I'm trying to build an ionic application which retrieves data from Cloudant using pouchdb. Cloudant allows creating only databases and documents.
How can I create some collections in Cloudant?

Two part answer:
A set of documents that meet certain criteria can be considered a collection in Cloudant/CouchDB. You can create views to fetch those documents. Such a view might check for the existence of a property in a document ("all documents with a property named type"), the value of a property ("all documents with a property named type having the value of book") or any other condition that makes sense for your application and return the appropriate documents.
You basically have to follow a three step process:
determine how you can identify documents in your database that you consider to be part of the collection
create a view based on your findings in the previous step
query the view to retrieve those documents
Above documentation link provides more details.
Properties in your document can represent collections as well, as in the following example, which defines a simple array of strings.
{
"mycollectionname": [
"element1",
"element2",
...
]
}
How you implement collections really depends on your use-case scenario.

Long post, but hope that helps.
I would like to explain this with a RDBMS analogy.
In any RDBMS, a new database would mean a different connection with different set of credentials.
A collection would mean the set of tables in that particular database.
A record would mean a row in a table.
Similarly, you can look at a single Cloudant service instance as a database(RDBMS terminology).
A collection would be a "database" in that service instance in Cloudant's terminology.
A document would correpond to a single row.
Hence, Cloudant has no concept of collection as such. If you need to store your related documents in a separate collection you must do it with multiple databases within the same service instance.
If you want to use only a single database, you could create a field like "record_index" to differentiate between the different documents. While querying these documents, you could use an index. For. e.g. I have a student database. But I do not want to store the records for Arts, Commerce, Science branches in different databases. I will add a field "record_type": "arts", etc. in the records. Create an index,
{ selector: {record_type: "arts"}}
Before doing any operation on the arts records, you can use this index and query the documents. In this way, you will be able to logically group your documents.

Two mongodb collections in one query

I have a big collection of clients and a huge collection of the clients data, the collections are separated and I don't want to combine them to a single collection (because of the other already working servlets) but now I need to "Join" data from both collection in a single result.
Since The query should return a big number of results I don't want to query the server once and then use the result to query again. I'm also concerned about the traffic between the server and the DB and the memory that the result set will occupy in the server RAM.
The way it's working now is that I get the relevant client list from the 'clients' collection and send this list to the query of the 'client data' collection and only then I get the aggregated results.
I want to cut off the getting and sending the client list from and right back to the server, get the server to ask himself, let the query of client data collection to ask clients collection for the relevant client list.
How can I use a stored procedure(javascript functions) to do the query in the DB and return only the relevant clients out of the collection.
Alternatively, Is there a way to write a query that joins result from another collection ?

"Good news everyone", this aggregation query work just fine in the mongo shell as a join query
db.clientData.aggregate([{
$match: {
id: {
$in: db.clients.distinct("_id",
{
"tag": "qa"
})
}
}
},
$group: {
_id: "$computerId",
total_usage: {
$sum: "$workingTime"
}
}
}]);

The key idea with MongoDB data modelling is to be write-heavy, not read-heavy: store the data in the format that you need for reading, not in some format that minimizes/avoids redundancy (i.e. use a de-normalized data model).
I don't want to combine them to a single collection
That's not a good argument
I'm also concerned about the traffic between the server and the DB [...]
If you need the data, you need the data. How does the way it is queried make a difference here?
[...] and the memory that the result set will occupy in the server RAM.
Is the amount of data so large that you want to stream it from the server to the client, such that is transferred in chunks? How much data are we talking, and why does the client read it all?
How can I use a stored procedure to do the query in the DB and return only the relevant clients out of the collection
There are no stored procedures in MongoDB, but you can use server-side map/reduce to 'join' collections. Generally, code that is stored in and run by the database is a violation of the layer architecture separation of concerns. I consider it one of the most ugly hacks of all time - but that's debatable.
Also, less debatable, keep in mind that M/R has huge overhead in MongoDB and is not geared towards real-time queries made e.g. in a web server call. These calls will take hundreds of milliseconds.
Is there a way to write a query that joins result from another collection ?
No, operations are constrained to a single collection. You can perform a second query and use the $in operator there, however, which is similar to a subselect and reasonably fast, but of course requires two round-trips.

How can I use a stored procedure to do the query in the DB and return only the relevant clients out of the collection. Alternatively
There are no procedure in Mongodb
Alternatively, Is there a way to write a query that joins result from another collection ?
You normally don't need to do any Joins in MongoDB and there is no such thing. The flexibility of the document handled already typical need of joins. You should the think about your document model and asking how to design joins out of your schema should always be your first port of call. As alternative you may need to use aggregation or Map-Reduce in server side to handle this.

First of all, mnemosyn and Michael9 are right. But if I were in your shoes, also assuming that the client data collection is one document per client, I would store the document ID of the client data document in the client document to make the "join" (still no joins in Mongo) easier.
If you have more client data documents per client then an array of document IDs.
But all this does not save you from that you have to implement the "join" in your application code, if it's a Rails app then in your controller probably.

Is it encouraged to query a MongoDB database in a loop?

In an SQL database, if I wanted to access some sort of nested data, such as a list of tags or categories for each item in a table, I'd have to use some obscure form of joining in order to send the SQL query once and then only loop through the result cursor.
My question is, in a NoSQL database such as MongoDB, is it OK to query the database repeatedly such that I can do the previous task as follows:
cursor = query for all items
for each item in cursor do
tags = query for item's tags
I know that I can store the tags in an array in the item's document, but I'm assuming that it is somehow not possible to store everything inside the same document. If that is the case, would it be expensive to requery the database repeatedly or is it designed to be used that way?

No, neither in Mongo, nor in any other database should you query a database in a loop. And one good reason for this is performance: in most web apps, database is a bottleneck and devs trying to make as small amount of db calls as possible, whereas here you are trying to make as many as possible.
I mongo you can do what you want in many ways. Some of them are:
putting your tags inside the document {itemName : 'item', tags : [1, 2, 3]}
knowing the list of elements, you do not need a loop to find information about them. You can fetch all results in one query with $in : db.tags.find({ field: { $in: [<value1>, <value2>, ... <valueN> ] }})

You should always try to fulfill a request with as few queries as possible. Keep in mind that each query, even when the database can answer it entirely from cache, requires a network roundtrip between application server, database and back.
Even when you assume that both servers are in the same datacenter and only have a latency of microseconds, these latency times will add up when you query for a large number of documents.
Relational databases solve this issue with the JOIN command. But unfortunately MongoDB has no support for joins. For that reason you should try to build your documents in a way that the most common queries can be answered by a single document. That means that you should denormalize your data. When you have a 1:n relation, you should consider to embed the referencing documents as an array in the main document. Having redundancies in your database is usually not as unacceptable in MongoDB as it is in relational databases.
When you still have good reasons to keep the child-documents as separate documents, you should use a query with the $in operator to query them all at once, as Salvador Dali suggested in his answer.

What is the typical usage of ElasticSearch in conjuncion with other storage?

It is not recommended to use ElasticSearch as the only storage from some obvious reasons like security, transactions etc. So how it is usually used together with other database?
Say, I want to store some documents in MongoDB and be able to effectively search by some of their properties. What I'd do would be to store full document in Mongo as usual and then trigger insertion to ElasticSearch but I'd insert only searchable properties plus MongoDB ObjectID there. Then I can search using ElasticSearch and having ObjectID found, go to Mongo and fetch whole documents.
Is this correct usage of ElasticSearch? I don't want to duplicate whole data as I have them already in Mongo.

The best practice is for now to duplicate documents in ES.
The cool thing here is that when you search, you don't have to return to your database to fetch content as ES provide it in only one single call.
You have everything with ES Search Response to display results to your user.
My 2 cents.

You may like to use mongodb river take a look at this post
There are more issue then the size of the data you store or index, you might like to have MongoDB as a backup with "near real time" query for inserted data. and as a queue for the data to indexed (you may like to use mongodb as cluster with the relevant write concern suited for you application

heterogeneous bulk update in mongodb

I know that we can bulk update documents in mongodb with
db.collection.update( criteria, objNew, upsert, multi )
in one db call, but it's homogeneous, i.e. all those documents impacted are following one kind of criteria. But what I'd like to do is something like
db.collection.update([{criteria1, objNew1}, {criteria2, objNew2}, ...]
, to send multiple update request which would update maybe absolutely different documents or class of documents in single db call.
What I want to do in my app is to insert/update a bunch of objects with compound primary key, if the key is already existing, update it; insert it otherwise.
Can I do all these in one combine in mongodb?

That's two seperate questions. To the first one; there is no MongoDB native mechanism to bulk send criteria/update pairs although technically doing that in a loop yourself is bound to be about as efficient as any native bulk support.
Checking for the existence of a document based on an embedded document (what you refer to as compound key, but in the interest of correct terminology to avoid confusion it's better to use the mongo name in this case) and insert/update depending on that existence check can be done with upsert :
document A :
{
_id: ObjectId(...),
key: {
name: "Will",
age: 20
}
}
db.users.update({name:"Will", age:20}, {$set:{age: 21}}), true, false)
This upsert (update with insert if no document matches the criteria) will do one of two things depending on the existence of document A :
Exists : Performs update "$set:{age:21}" on the existing document
Doesn't exist : Create a new document with fields "name" and field
"age" with values "Will" and "20" respectively (basically the
criteria are copied into the new doc) and then the update is applied
($set:{age:21}). End result is a document with "name"="Will" and
"age"=21.
Hope that helps

we are seeing some benefits of $in clause.
our use case was to update the 'status' in a document for a large number number records.
In our first cut, we were doing a for loop and doing updates one by 1. But then we switched to using $in clause and that made a huge improvement.

There is no real benefit from doing updates the way you suggest.
The reason that there is a bulk insert API and that it is faster is that Mongo can write all the new documents sequentially to memory, and update indexes and other bookkeeping in one operation.
A similar thing happens with updates that affect more than one document: the update will traverse the index only once and update objects as they are found.
Sending multiple criteria with multiple criteria cannot benefit from any of these optimizations. Each criteria means a separate query, just as if you issued each update separately. The only possible benefit would be sending slightly fewer bytes over the connection. The database would still have to do each query separately and update each document separately.
All that would happen would be that Mongo would queue the updates internally and execute them sequentially (because only one update can happen at any one time), this is exactly the same as if all the updates were sent separately.
It's unlikely that the overhead in sending the queries separately would be significant, Mongo's global write lock will be the limiting factor anyway.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse