Azure Cosmos DB Document Change Trigger With Old and new document? - triggers

I have the cosmos db trigger in azure functions and it fires when the document is changed which is fine.
But my document is big and i only need the updated property in the trigger.
I can solve this by comparing the old and new document but in trigger i only get the updated document.
So is there a way to get the old and updated document in trigger.
My Azure Function trigger is
module.exports = async function (context, documents) {
if (!!documents && documents.length > 0) {
context.log('Document Id: ', documents[0].id);
context.log('Document : ', documents[0]);
}
context.done(); }
My Function Bindings are
{ "bindings": [
{
"type": "cosmosDBTrigger",
"name": "documents",
"direction": "in",
"leaseCollectionName": "leases",
"connectionStringSetting": "AzureWebJobsCosmosDBConnectionString",
"databaseName": "ToDoList",
"collectionName": "Items",
"createLeaseCollectionIfNotExists": true
} ], "disabled": false }
Thanks in Advance

There is no way, at this point, to obtain the previous version or to just get the delta.
The Change Feed contains the operation and payload, not a reference to the previous state.

There is an open suggestion # https://feedback.azure.com
You might implement a pattern as stated in above link:
Store every version/change as a separate item
Read the change feed to merge/consolidate changes and trigger appropriate actions downstream.
So basically you need another function with a change feed listener that stores every document in a separate "version" collection. Afterwards you can add a CosmosDB binding, that gets the latest version from latter collection.

Related

CosmoDb with Robomongo cant see document id's?

Can anyone tell me why when I use DataExplorer for CosmoDb DB I get the following:
{
"id": "d502b51a-e70a-40f1-9285-3861880b8d90",
"Version": 1,
...
}
But when I use Robomongo I get:
{
"Version" : 1,
...
}
minus the id?
Thanks
I tried to repro your scenario but it all worked correctly.
The Mongo document in Portal Data Explorer:
The Mongo document in Robo 3T:
They both have the id property.
Are you applying Projections on Robomongo / Robo 3T?
At this moment cosmodb works separately SQL API and Mongo API, each one has different implementation, SQL API use JSON and Mongo use BSON, you need to be clear this while you are creating the document.
If you create the document with a BSON-based tool like Robo3t for example, you are going to get something like this:
{
"_id": {
"$oid": "5be0d98b9cdcce3c6ce0f6b8"
},
"name": "Name",
"id": "5be0d98b9cdcce3c6ce0f6b8",
...
}
Instead, if you create your document with JSON-based like Data Explorer, you are going to get this:
{
"name": "Name",
"id": "6c5c05b4-dfce-32a5-0779-e30821e6c510",
...
}
As you can see, BSON-based needs that _id and inside $oid be implemented to works right, while JSON-based only id is required. So, you need to add the properties while you save the document (see below) or open it with the right tool, as Matias Quaranta recommend, use Azure Storage Explorer or even Data Explorer to get both protocols properly.
Also, if you use a system to create the document and you want to use BSON format, You need to add the $oid, for example in core net is something like this:
public bool TryGetMemberSerializationInfo(string memberName, out BsonSerializationInfo serializationInfo)
{
switch (memberName)
{
case "Id":
serializationInfo = new BsonSerializationInfo("_id", new ObjectIdSerializer(), typeof(ObjectId));
return true;
case "Name":
serializationInfo = new BsonSerializationInfo("name", new StringSerializer(), typeof(string));
return true;
default:
serializationInfo = null;
return false;
}
}

index Elasticsearch document with existing "id" field

I have documents that I want to index into Elasticsearch with an existing unique "id" field.
I get an array of documents from a REST api endpoint ( eg.: http://some.url/api/products) in no particular order and if a document with the _id already exists in Elasticsearch it should update and reindex the document.
I want to create a new document if no document with the _id in Elasticsearch exists and then update a document, if it matches with an existing document in Elasticsearch.
This could be done with:
PUT products/product/un1qu3-1d-b718-105973677e95
{
"id": "un1qu3-1d-b718-105973677e95",
"state": "packaged"
}
The basic idea is to use the provided "id" field to create or update a document. Extraction of _id from document fields seems deprecated (link). But the indexing/ reindexing of documents with the "id" field can be done manually very easy with the kibana dev tools, with postman or a cURL request.
I want to achieve this (re-)indexing of documents that I receive over this api endpoint programmatically.
Is it possible to achieve this with logstash or a simple cronjob? Does Elasticsearch provide any functionality for this? Or do I need to write some custom backend to achieve this?
I thought of either:
1) index the document into Elasticsearch with the "id" field of my document or
2) find an Elasticsearch query that first searches for the document with the specific "id" field and then updates the document.
I was unable to find a solution for either way and have no clue how a good approach would look like.
Can anyone point me into the right direction on how to achieve this, suggest a better approach or provide a solution?
Any help much appreciated!
Update
I solved the problem with the help of the accepted answer. I used Logstash, the Http_poller input plugin, this article: https://www.elastic.co/blog/new-way-to-ingest-part-1 and this elastic.co question: https://discuss.elastic.co/t/upsert-with-logstash/59116
My output of logstash looks like this at the moment:
output {
elasticsearch {
index => "products"
document_type => "product"
pipeline => "rename_id"
document_id => "%{id}"
doc_as_upsert => true
action => "update"
}
Update 2
just for the sake of completeness I added the "rename_id" pipeline
{
"rename_id": {
"description": "_description",
"processors": [
{
"set": {
"field": "_id",
"value": "{{id}}"
}
}
]
}
}
It works this way!
Thanks alot!
Peter,
If I understand correctly, you want to ingest your documents into elastic search and will have some updates in future for these documents ?
If that's the case,
- Use your documents primary key as id for elastic documents.
- You can ingest entire document with updated values, elastic will replace the previous document with new one. given the primary key is same. Old document with same id will be deleted.
We use this approach for our search data.
you can use ingest pipelines to extract the id from the body and the _create endpoint to only create a document if it does not exist. Minor note: If you could specify the id on the client side indexing would be faster, as adding a pipeline adds a certain overhead.
PUT _ingest/pipeline/my_pipeline
{
"description": "_description",
"processors": [
{
"set": {
"field": "_id",
"value": "{{id}}"
}
}
]
}
PUT twitter/tweet/1?op_type=create&pipeline=my_pipeline
{
"foo" : "bar",
"id" : "123"
}
GET twitter/tweet/123
# this call will fail
PUT twitter/tweet/1?op_type=create&pipeline=my_pipeline
{
"foo" : "bar",
"id" : "123"
}
You can use script to UPSERT (update or insert) your document
PUT /products/product/un1qu3-1d-b718-105973677e95/_update
{
"script": {
"inline": "ctx._source.state = \"packaged\"",
"lang": "painless"
},
"upsert": {
"id": "un1qu3-1d-b718-105973677e95",
"state": "packaged"
}
}
Above query find the document with _id = "un1qu3-1d-b718-105973677e95"
if it is able to find any document then it will update state to "packaged" otherwise create a new document with field "id" and "state" (you can insert as many fields as you want).

Find DocumentId through Discovery GUI tool

I want to train my Discovery collection where I have already uploaded over 200 documents. I uploaded these documents through the GUI. Looking through the Discovery documentation, I know that I have will have to make API calls to train my collection since the training API has not been exposed through the GUI yet. As part of the training API calls I need to include a document that looks like this:
{
"natural_language_query": "{natural_language_query}",
"filter": "{filter_definition}"
"examples": [
{
"document_id": "{document_id_1}",
"cross_reference": "{cross_reference_1}",
"relevance": 0
},
{
"document_id": "{document_id_2}",
"cross_reference": "{cross_reference_2}",
"relevance": 0
}
]
}
My question is how should I get the documentIds for the documents that I have already uploaded? Is there a way to find this through the GUI? Or perhaps an API call that will return something like:
{
"document_name" = "MyDocument1",
"documentId" = "the_document_id_for_MyDocument1"
},
...
{
"document_name" = "MyDocumentN",
"documentId" = "the_document_id_for_MyDocumentN"
}
Or would the only way to get the documentIds would be to create a new collection and upload all of the documents through API calls directly and track the documentIds as I get them back?
Using the GUI, perform the following steps:
Input term(_id) in the "Group query results (Aggregation)"
textbox.
Under "Fields to return", select "Specify" to input
extracted_metadata
Note, that query and filter inputs should remain empty

How do I rename a nested key in mongodb

I want rename to rename my dict key in mongodb.
normally it works like that db.update({'_id':id},{$rename:{'oldfieldname':newfieldname}})
My document structure looks like that
{
'data':'.....',
'field':{'1':{'data':....},'2':{'data'...}},
'more_data':'....',
}
if i want to set
a new field in field 1 i do db.update({'_id':id},{$set:{'field.0.1.name':'peter'}})
for field two it is 'field'.1.2.name'
i thought with the rename it should be similar but it isn't ... (like $rename:{'field'.0.1': 2}
Here's a flexible method for renaming keys in a database
Given a document structure like this...
{
"_id": ObjectId("4ee5e9079b14f74ef14ddd2f"),
"code": "130.4",
"description": "4'' Socket Plug",
"technicalData": {
"Drawing No": "50",
"length": "200mm",
"diameter: "20mm"
},
}
I want to loop through all documents and rename technicalData["Drawing No"] to technicalData["Drawing Number"]
Run the following javascript in the execute panel in (the excellent) RockMongo
function remap(x){
dNo = x.technicalData["Drawing No"];
db.products.update({"_id":x._id}, {
$set: {"technicalData.Drawing Number" : dNo},
$unset: {"technicalData.Drawing No":1}
});
}
db.products.find({"technicalData.Drawing No":{$ne:null}}).forEach(remap);
The code will also run in a mongo shell
Your question is unclear but it seems you'd like to rename a field name within an array.
The short answer is you can't. As stated in the docs, $rename doesn't expand arrays to find a matching name. It only works on top level fields.
What you can do to simulate rename is by copying the field and its data to the new name, and then deleting the original field. You might also need a way to account for potentially concurrent writes if you have a lot of writes to that object/field.

Ways to implement data versioning in MongoDB

Can you share your thoughts how would you implement data versioning in MongoDB. (I've asked similar question regarding Cassandra. If you have any thoughts which db is better for that please share)
Suppose that I need to version records in an simple address book. (Address book records are stored as flat json objects). I expect that the history:
will be used infrequently
will be used all at once to present it in a "time machine" fashion
there won't be more versions than few hundred to a single record.
history won't expire.
I'm considering the following approaches:
Create a new object collection to store history of records or changes to the records. It would store one object per version with a reference to the address book entry. Such records would looks as follows:
{
'_id': 'new id',
'user': user_id,
'timestamp': timestamp,
'address_book_id': 'id of the address book record'
'old_record': {'first_name': 'Jon', 'last_name':'Doe' ...}
}
This approach can be modified to store an array of versions per document. But this seems to be slower approach without any advantages.
Store versions as serialized (JSON) object attached to address book entries. I'm not sure how to attach such objects to MongoDB documents. Perhaps as an array of strings.
(Modelled after Simple Document Versioning with CouchDB)
The first big question when diving in to this is "how do you want to store changesets"?
Diffs?
Whole record copies?
My personal approach would be to store diffs. Because the display of these diffs is really a special action, I would put the diffs in a different "history" collection.
I would use the different collection to save memory space. You generally don't want a full history for a simple query. So by keeping the history out of the object you can also keep it out of the commonly accessed memory when that data is queried.
To make my life easy, I would make a history document contain a dictionary of time-stamped diffs. Something like this:
{
_id : "id of address book record",
changes : {
1234567 : { "city" : "Omaha", "state" : "Nebraska" },
1234568 : { "city" : "Kansas City", "state" : "Missouri" }
}
}
To make my life really easy, I would make this part of my DataObjects (EntityWrapper, whatever) that I use to access my data. Generally these objects have some form of history, so that you can easily override the save() method to make this change at the same time.
UPDATE: 2015-10
It looks like there is now a spec for handling JSON diffs. This seems like a more robust way to store the diffs / changes.
There is a versioning scheme called "Vermongo" which addresses some aspects which haven't been dealt with in the other replies.
One of these issues is concurrent updates, another one is deleting documents.
Vermongo stores complete document copies in a shadow collection. For some use cases this might cause too much overhead, but I think it also simplifies many things.
https://github.com/thiloplanz/v7files/wiki/Vermongo
Here's another solution using a single document for the current version and all old versions:
{
_id: ObjectId("..."),
data: [
{ vid: 1, content: "foo" },
{ vid: 2, content: "bar" }
]
}
data contains all versions. The data array is ordered, new versions will only get $pushed to the end of the array. data.vid is the version id, which is an incrementing number.
Get the most recent version:
find(
{ "_id":ObjectId("...") },
{ "data":{ $slice:-1 } }
)
Get a specific version by vid:
find(
{ "_id":ObjectId("...") },
{ "data":{ $elemMatch:{ "vid":1 } } }
)
Return only specified fields:
find(
{ "_id":ObjectId("...") },
{ "data":{ $elemMatch:{ "vid":1 } }, "data.content":1 }
)
Insert new version: (and prevent concurrent insert/update)
update(
{
"_id":ObjectId("..."),
$and:[
{ "data.vid":{ $not:{ $gt:2 } } },
{ "data.vid":2 }
]
},
{ $push:{ "data":{ "vid":3, "content":"baz" } } }
)
2 is the vid of the current most recent version and 3 is the new version getting inserted. Because you need the most recent version's vid, it's easy to do get the next version's vid: nextVID = oldVID + 1.
The $and condition will ensure, that 2 is the latest vid.
This way there's no need for a unique index, but the application logic has to take care of incrementing the vid on insert.
Remove a specific version:
update(
{ "_id":ObjectId("...") },
{ $pull:{ "data":{ "vid":2 } } }
)
That's it!
(remember the 16MB per document limit)
If you're looking for a ready-to-roll solution -
Mongoid has built in simple versioning
http://mongoid.org/en/mongoid/docs/extras.html#versioning
mongoid-history is a Ruby plugin that provides a significantly more complicated solution with auditing, undo and redo
https://github.com/aq1018/mongoid-history
I worked through this solution that accommodates a published, draft and historical versions of the data:
{
published: {},
draft: {},
history: {
"1" : {
metadata: <value>,
document: {}
},
...
}
}
I explain the model further here: http://software.danielwatrous.com/representing-revision-data-in-mongodb/
For those that may implement something like this in Java, here's an example:
http://software.danielwatrous.com/using-java-to-work-with-versioned-data/
Including all the code that you can fork, if you like
https://github.com/dwatrous/mongodb-revision-objects
If you are using mongoose, I have found the following plugin to be a useful implementation of the JSON Patch format
mongoose-patch-history
Another option is to use mongoose-history plugin.
let mongoose = require('mongoose');
let mongooseHistory = require('mongoose-history');
let Schema = mongoose.Schema;
let MySchema = Post = new Schema({
title: String,
status: Boolean
});
MySchema.plugin(mongooseHistory);
// The plugin will automatically create a new collection with the schema name + "_history".
// In this case, collection with name "my_schema_history" will be created.
I have used the below package for a meteor/MongoDB project, and it works well, the main advantage is that it stores history/revisions within an array in the same document, hence no need for an additional publications or middleware to access change-history. It can support a limited number of previous versions (ex. last ten versions), it also supports change-concatenation (so all changes happened within a specific period will be covered by one revision).
nicklozon/meteor-collection-revisions
Another sound option is to use Meteor Vermongo (here)