I find solution to update data in elasticsearch with golang. The data is about 1,000,000+++ documents and must be specific with id of document. I can update in mongoDB with using bulk operation but I can't find it in elasticsearch it is have a operation like it? or anyony have idea to update huge of data in elasticsearch with specific id. Thanks in advance.
In general, you can use bulk API to make such bulk updates. You can either index data again using same id or just run update. You can use CURL to push the updates from command line, if you are doing it as one off update.
POST _bulk
{ "index" : { "_index" : "test", "_id" : "1" } }
{ "field1" : "value1" }
{ "update" : {"_id" : "1", "_index" : "test"} }
{ "doc" : {"field2" : "value2"} }
Other option is to use update_by_query, if you are setting custom fields. With update by query, you can also mix it with pipeline to update existing data.
It entirely comes down whether you are trying to run update using information from different index (in such case, you can use enrich processor, which is available in 7.5 onwards) OR if you simply want to add a new field and update it using some rule which already uses attributes available on the document.
So for different type of scenario, different options are available. Bulk API is more appropriate, when the data source is external. But if data is already available on Elasticsearch, then update by query is appropriate.
You can also look at reindexing with pipeline scripting. But again, horses for courses rule applies here as well.
Related
I have stored the following document in my Cosmos DB using the Mongo API:
{
"_id" : ObjectId("59157eaabfeb1900011592c8"),
"imageResourceId" : "1489496086018.png",
"gallery" : "Tst",
"thumbnailRaw" : {
"$binary" : "<SNIP>",
"$type" : "00"
},
"tags" : [
"Weapon/Sword",
"Japanese"
],
"__v" : 1
}
I'm trying to perform a query that excludes any objects containing the tag "Japanese". I've crafted the following query, which performs correctly (that is, it does not return the above document) on a real Mongo DB:
{"gallery":"Tst, "tags":{"$nin":["Japanese"]}}
On Cosmos DB, this query returns the above image, despite the presence of a string found in the $nin array. Am I doing this query correctly? Is there another, supported way for Cosmos DB to do a NOT IN logical operation?
I have a different issue with CosmosDB, that made me run a few tests on operations with arrays, I believe that in your case this should work:
db.gallery.find({"tags":{"$elemMatch":{$nin: ["japanase"]}}} )
my issue:
Azure Cosmos DB check if array in field is contained in search array
I agree with the comment that CosmosDB is implementing only a subset of mongoDB, and documentation is very scarce, but I hope the fix I propose works for you.
I have removed some documents in my last query by mistake, Is there any way to rollback my last query mongo collection.
Here it is my last query :
db.foo.remove({ "name" : "some_x_name"})
Is there any rollback/undo option? Can I get my data back?
There is no rollback option (rollback has a different meaning in a MongoDB context), and strictly speaking there is no supported way to get these documents back - the precautions you can/should take are covered in the comments. With that said however, if you are running a replica set, even a single node replica set, then you have an oplog. With an oplog that covers when the documents were inserted, you may be able to recover them.
The easiest way to illustrate this is with an example. I will use a simplified example with just 100 deleted documents that need to be restored. To go beyond this (huge number of documents, or perhaps you wish to only selectively restore etc.) you will either want to change the code to iterate over a cursor or write this using your language of choice outside the MongoDB shell. The basic logic remains the same.
First, let's create our example collection foo in the database dropTest. We will insert 100 documents without a name field and 100 documents with an identical name field so that they can be mistakenly removed later:
use dropTest;
for(i=0; i < 100; i++){db.foo.insert({_id : i})};
for(i=100; i < 200; i++){db.foo.insert({_id : i, name : "some_x_name"})};
Now, let's simulate the accidental removal of our 100 name documents:
> db.foo.remove({ "name" : "some_x_name"})
WriteResult({ "nRemoved" : 100 })
Because we are running in a replica set, we still have a record of these documents in the oplog (being inserted) and thankfully those inserts have not (yet) fallen off the end of the oplog (the oplog is a capped collection remember) . Let's see if we can find them:
use local;
db.oplog.rs.find({op : "i", ns : "dropTest.foo", "o.name" : "some_x_name"}).count();
100
The count looks correct, we seem to have our documents still. I know from experience that the only piece of the oplog entry we will need here is the o field, so let's add a projection to only return that (output snipped for brevity, but you get the idea):
db.oplog.rs.find({op : "i", ns : "dropTest.foo", "o.name" : "some_x_name"}, {"o" : 1});
{ "o" : { "_id" : 100, "name" : "some_x_name" } }
{ "o" : { "_id" : 101, "name" : "some_x_name" } }
{ "o" : { "_id" : 102, "name" : "some_x_name" } }
{ "o" : { "_id" : 103, "name" : "some_x_name" } }
{ "o" : { "_id" : 104, "name" : "some_x_name" } }
To re-insert those documents, we can just store them in an array, then iterate over the array and insert the relevant pieces. First, let's create our array:
var deletedDocs = db.oplog.rs.find({op : "i", ns : "dropTest.foo", "o.name" : "some_x_name"}, {"o" : 1}).toArray();
> deletedDocs.length
100
Next we remind ourselves that we only have 100 docs in the collection now, then loop over the 100 inserts, and finally revalidate our counts:
use dropTest;
db.foo.count();
100
// simple for loop to re-insert the relevant elements
for (var i = 0; i < deletedDocs.length; i++) {
db.foo.insert({_id : deletedDocs[i].o._id, name : deletedDocs[i].o.name});
}
// check total and name counts again
db.foo.count();
200
db.foo.count({name : "some_x_name"})
100
And there you have it, with some caveats:
This is not meant to be a true restoration strategy, look at backups (MMS, other), delayed secondaries for that, as mentioned in the comments
It's not going to be particularly quick to query the documents out of the oplog (any oplog query is a table scan) on a large busy system.
The documents may age out of the oplog at any time (you can, of course, make a copy of the oplog for later use to give you more time)
Depending on your workload you might have to de-dupe the results before re-inserting them
Larger sets of documents will be too large for an array as demonstrated, so you will need to iterate over a cursor instead
The format of the oplog is considered internal and may change at any time (without notice), so use at your own risk
While I understand this is a bit old but I wanted to share something that I researched in this area that may be useful to others with a similar problem.
The fact is that MongoDB does not Physically delete data immediately - it only marks it for deletion. This is however version specific and there is currently no documentation or standardization - which could enable a third party tool developer (or someone in desperate need) to build a tool or write a simple script reliably that works across versions. I opened a ticket for this - https://jira.mongodb.org/browse/DOCS-5151.
I did explore one option which is at a much lower level and may need fine tuning based on the version of MongoDB used. Understandably too low level for most people's linking, however it works and can be handy when all else fails.
My approach involves directly working with the binary in the file and using a Python script (or commands) to identify, read and unpack (BSON) the deleted data.
My approach is inspired by this GitHub project (I am NOT the developer of this project). Here on my blog I have tried to simplify the script and extract a specific deleted record from a Raw MongoDB file.
Currently a record is marked for deletion as "\xee" at the start of the record. This is what a deleted record looks like in the raw db file,
‘\xee\xee\xee\xee\x07_id\x00U\x19\xa6g\x9f\xdf\x19\xc1\xads\xdb\xa8\x02name\x00\x04\x00\x00\x00AAA\x00\x01marks\x00\x00\x00\x00\x00\x00#\x9f#\x00′
I replaced the first block with the size of the record which I identified earlier based on other records.
y=”3\x00\x00\x00″+x[20804:20800+51]
Finally using the BSON package (that comes with pymongo), I decoded the binary to a Readable object.
bson.decode_all(y)
[{u’_id': ObjectId(‘5519a6679fdf19c1ad73dba8′), u’name': u’AAA’, u’marks': 2000.0}]
This BSON is a python object now and can be dumped into a recover collection or simply logged somewhere.
Needless to say this or any other recovery technique should be ideally done in a staging area on a backup copy of the database file.
I'm storing timeseries in MongoDB and the strucuture is as follows:
{
"_id" : ObjectId("5128e567df6232180e00fa7d"),
"values" : [563.424, 520.231, 529.658, 540.459, 544.271, 512.641, 579.591, 613.878, 627.708, 636.239, 672.883, 658.895, 646.44, 619.644, 623.543, 600.527, 619.431, 596.184, 604.073, 596.556, 590.898, 559.334, 568.09, 568.563],
"day" : 20110628,
}
The values-array is representing a value for each hour. So the position is important since position 0 = first hour, 1 = second hour and so on.
To update the value of a specific hour is quite easy. For example, to update the 7th hour of the day I do this:
db.timeseries.update({day:20130203},{$set : {values.6 : 482.65}}, {upsert : true})
My problem is that I would like to use upsert, like this
db.timeseries.update({day:20130203},{$set : {values.6 : 482.65}})
But if the document does not exist, MongoDB will craete an embedded document intead of an embedded array. Like this:
{
"_id" : ObjectId("5128e567df6232180e00fa7d"),
"values" : {"6" : 482.65},
"day" : 20130203,
}
There is a ticket to add a feature to solve this issue here, but meanwhile I have come up with a work-around to solve this in my case.
What I do, is that I first created a uniqe-index on the day-field. And whenever I want to upsert a hourly volume I do these two commands.
db.timeseries.insert({day:20130203, values : []}); // Will be rejected if it exists
db.timeseries.update({day:20130203},{$set : {values.6 : 482.65}});
The first statement tried to create a new document - and thanks to the uniqe-index the insert will be rejected if it already exists. If not, a document with an embedded array for value-field will be created. This ensures that the update will work.
Result:
{
"_id" : ObjectId("5128e567df6232180e00fa7d"),
"values" : [null,null,null,null,null,null,482.65],
"day" : 20130203,
}
And here's is my question
In production, when several commands like this will be run simultaneously can I be sure that my update-command will be executed after my insert-command? Note that I want to run both commands in unsafe-mode, that is I will not wait for any response from the server.
(It would also be interesting to here comments about my work-around from a performance perspective.)
Generally yes, there is a way to ensure that two requests from a client use the same connection. By using the same connection you force a strict order of execution on the server.
The way to accomplish this are different for different drivers.
For the Asynchronous Java Driver you can create a "Serialized" MongoClient from the initial MongoClient instance and it will ensure that all requests use a single connection.
For the 10gen java driver it will automatically (via a ThreadLocal) try to use the same connection. You can also give a hint to the driver via the DB.requestStart()/DB.requestEnd() methods that a group of commands need to be pipe-lined.
The startRequest/endRequest applies to most of the 10gen drivers. As another example the PyMongo driver mongo_client has a start_request()/end_request() pair.
From a performance point of view, it is better using only one access to the database than two. Cannot you use $push instead of $set for updating the values field?
I have two MongoDB collections
promo collection:
{
"_id" : ObjectId("5115bedc195dcf55d8740f1e"),
"curr" : "USD",
"desc" : "durable bags.",
"endDt" : "2012-08-29T16:04:34-04:00",
origPrice" : 1050.99,
"qtTotal" : 50,
"qtClaimd" : 30,
}
claimed collection:
{
"_id" : ObjectId("5117c749195d62a666171968"),
"proId" : ObjectId("5115bedc195dcf55d8740f1e"),
"claimT" : ISODate("2013-02-10T16:14:01.921Z")
}
Whenever someone claimed a promo, a new document will be created inside "claimedPro" collection where proId is a (virtual) foreign key to first (promo) collection. Every claim should increment a counter "qtClaimd" in "promo" collection. What's the best way to increment a value in another collection in a transactional fashion? I understand MongoDB doesn't have isolation for multiple docs.
Also, reason why I went with "non-embedded" approach is as follow
promo gets created and published to users then claims will happen in 100s of thousands amounts. I didn't think it was logical to embed claims inside promo collection given the number of writes will happen in a single document ('coz mongo resizes promo collection when size grows due to thousands of claims). Having non embedded approach keeps promo collection unaffected but insert new document in "claims" collection. Later while generating report I'll have to display "promo" details along with "claims" details for that promo. With non-embedded approach I'll have to first query "promo" collection and then "claims" collection with "proId". *Also worth mentioning that there could be times where 100s of "claims" can happen simultaneously for the same "promo" *.
What's the best way to achieve trnsactional effect with these two collections? I am using Scala, Casbah and Salat all with Scala 2.10 version.
db.bar.update({ id: 1 }, { $inc: { 'some.counter': 1 } });
Just look at how to run this with SalatDAO, I'm not a play user so I wouldn't want to give you wrong advise about that. $inc is the Mongo way to increment.
Given a large (millions+) collection of documents similar to:
{ _id : ObjectId, "a" : 3, "b" : 5 }
What is the most efficient way to process these documents directly on the server, with the results added to each document within the same collection? For example, add a key c whose value equals a+b.
{ _id : ObjectId, "a" : 3, "b" : 5, "c" : 8 }
I'd prefer to do this in the shell.
Seems that find().forEach() would waste time in transit between the db and the shell, and mapReduce() seems intended to process groups of objects down into aggregated data (though I may be misunderstanding).
EDIT: I'd prefer a solution that doesn't block, if there is one (other than using a cursor on the client)...
From the MongoDB Docs on db.eval():
"db.eval() is used to evaluate a function (written in JavaScript) at the database server.
This is useful if you need to touch a lot of data lightly. In that scenario, network transfer of the data could be a bottleneck."
The documentation has an example of how to use it that is very similar to what you are trying to do.
forEach is your best option. I would run it on the server (from the shell) to reduce latency.