I have documents that I want to index into Elasticsearch with an existing unique "id" field.
I get an array of documents from a REST api endpoint ( eg.: http://some.url/api/products) in no particular order and if a document with the _id already exists in Elasticsearch it should update and reindex the document.
I want to create a new document if no document with the _id in Elasticsearch exists and then update a document, if it matches with an existing document in Elasticsearch.
This could be done with:
PUT products/product/un1qu3-1d-b718-105973677e95
{
"id": "un1qu3-1d-b718-105973677e95",
"state": "packaged"
}
The basic idea is to use the provided "id" field to create or update a document. Extraction of _id from document fields seems deprecated (link). But the indexing/ reindexing of documents with the "id" field can be done manually very easy with the kibana dev tools, with postman or a cURL request.
I want to achieve this (re-)indexing of documents that I receive over this api endpoint programmatically.
Is it possible to achieve this with logstash or a simple cronjob? Does Elasticsearch provide any functionality for this? Or do I need to write some custom backend to achieve this?
I thought of either:
1) index the document into Elasticsearch with the "id" field of my document or
2) find an Elasticsearch query that first searches for the document with the specific "id" field and then updates the document.
I was unable to find a solution for either way and have no clue how a good approach would look like.
Can anyone point me into the right direction on how to achieve this, suggest a better approach or provide a solution?
Any help much appreciated!
Update
I solved the problem with the help of the accepted answer. I used Logstash, the Http_poller input plugin, this article: https://www.elastic.co/blog/new-way-to-ingest-part-1 and this elastic.co question: https://discuss.elastic.co/t/upsert-with-logstash/59116
My output of logstash looks like this at the moment:
output {
elasticsearch {
index => "products"
document_type => "product"
pipeline => "rename_id"
document_id => "%{id}"
doc_as_upsert => true
action => "update"
}
Update 2
just for the sake of completeness I added the "rename_id" pipeline
{
"rename_id": {
"description": "_description",
"processors": [
{
"set": {
"field": "_id",
"value": "{{id}}"
}
}
]
}
}
It works this way!
Thanks alot!
Peter,
If I understand correctly, you want to ingest your documents into elastic search and will have some updates in future for these documents ?
If that's the case,
- Use your documents primary key as id for elastic documents.
- You can ingest entire document with updated values, elastic will replace the previous document with new one. given the primary key is same. Old document with same id will be deleted.
We use this approach for our search data.
you can use ingest pipelines to extract the id from the body and the _create endpoint to only create a document if it does not exist. Minor note: If you could specify the id on the client side indexing would be faster, as adding a pipeline adds a certain overhead.
PUT _ingest/pipeline/my_pipeline
{
"description": "_description",
"processors": [
{
"set": {
"field": "_id",
"value": "{{id}}"
}
}
]
}
PUT twitter/tweet/1?op_type=create&pipeline=my_pipeline
{
"foo" : "bar",
"id" : "123"
}
GET twitter/tweet/123
# this call will fail
PUT twitter/tweet/1?op_type=create&pipeline=my_pipeline
{
"foo" : "bar",
"id" : "123"
}
You can use script to UPSERT (update or insert) your document
PUT /products/product/un1qu3-1d-b718-105973677e95/_update
{
"script": {
"inline": "ctx._source.state = \"packaged\"",
"lang": "painless"
},
"upsert": {
"id": "un1qu3-1d-b718-105973677e95",
"state": "packaged"
}
}
Above query find the document with _id = "un1qu3-1d-b718-105973677e95"
if it is able to find any document then it will update state to "packaged" otherwise create a new document with field "id" and "state" (you can insert as many fields as you want).
Related
{
questions: {
q1: "",
}
}
result after updating:
{
questions: {
q1: "",
q2: ""
}
}
I want to add q2 inside questions, without overwritting what's already inside it (q1).
One solution I found is to get the whole document, and modify it on my backend, and send the whole document to replace the current one. But it seems really in-effiecient as I have other fields in the document as well.
Is there a query that does it more efficiently? I looked at the mongodb docs, didn't seem to find a query that does it.
As turivishal said, you can use $set like this
But you also can use $addFields in this way:
db.collection.aggregate([
{
"$match": {
"questions.q1": "q1"
}
},
{
"$addFields": {
"questions.q2": "q2"
}
}
])
Example here
Also, reading your comment where you say Cannot create field 'q2' in element {questions: "q1"}. It seems your original schema is not the same you have into your DB.
Your schema says that questions is an object with field q1. But your error says that questions is the field and q1 the value.
Say I have an object with field state, I want to update this field, while keeping the previous value of state in previous_state field. First, I have tried to make an update with unset-rename-set:
collection.update(query, {$unset: {previous_state: ""}, $rename: {state: "previous_state"}, $set: {state: value}})
no surprise it did not work. After reading:
Update MongoDB field using value of another field
MongoDB update: Generate new field based on existing field, or update in place
Update field with another field's value in the document
I am nearly convinced that I do not have a solution to perform this in a single query. So the question is what is the best practice to do it?
There are various ways to do it, depending on the version of MongoDB, and they are described in this answer from another thread: https://stackoverflow.com/a/37280419/5538923 .
For MongoDB 3.4+, for example, there is this query that can be put in MongoSH:
db.collection.aggregate(
[
{ "$addFields": {
"previous_state": { "$concat": [ "$state" ] },
"state": { "$concat": [ "$state", " customly modified" ] }
}},
{ "$out": "collection" }
])
Also note that this query works only when the MongoDB instance is not sharded. When sharded (e.g., often the case in Microsoft Azure CosmosDB), the method described in that answer for MongoDB 3.2+ works, or alternatively put a new collection as destination (in the $out close), and then import the data in the original collection, after removing all the data there.
One solution (if you've got onlty one writer) could be to trigger your update in two steps:
> var previousvalue = collection.findOne(query).state;
> collection.update(query, {$set: {"previous_state": previousvalue, "state": newstatevalue}});
I have different types of data that would be difficult to model and scale with a relational database (e.g., a product type)
I'm interested in using Mongodb to solve this problem.
I am referencing the documentation at mongodb's website:
http://docs.mongodb.org/manual/tutorial/model-referenced-one-to-many-relationships-between-documents/
For the data type that I am storing, I need to also maintain a relational list of id's where this particular product is available (e.g., store location id's).
In their example regarding "one-to-many relationships with embedded documents", they have the following:
{
name: "O'Reilly Media",
founded: 1980,
location: "CA",
books: [12346789, 234567890, ...]
}
I am currently importing the data with a spreadsheet, and want to use a batchInsert.
To avoid duplicates, I assume that:
1) I need to do an ensure index on the ID, and ignore errors on the insert?
2) Do I then need to loop through all the ID's to insert a new related ID to the books?
Your question could possibly be defined a little better, but let's consider the case that you have rows in a spreadsheet or other source that are all de-normalized in some way. So in a JSON representation the rows would be something like this:
{
"publisher": "O'Reilly Media",
"founded": 1980,
"location": "CA",
"book": 12346789
},
{
"publisher": "O'Reilly Media",
"founded": 1980,
"location": "CA",
"book": 234567890
}
So in order to get those sort of row results into the structure you wanted, one way to do this would be using the "upsert" functionality of the .update() method:
So assuming you have some way of looping the input values and they are identified with some structure then an analog to this would be something like:
books.forEach(function(book) {
db.publishers.update(
{
"name": book.publisher
},
{
"$setOnInsert": {
"founded": book.founded,
"location": book.location,
},
"$addToSet": { "books": book.book }
},
{ "upsert": true }
);
})
This essentially simplified the code so that MongoDB is doing all of the data collection work for you. So where the "name" of the publisher is considered to be unique, what the statement does is first search for a document in the collection that matches the query condition given, as the "name".
In the case where that document is not found, then a new document is inserted. So either the database or driver will take care of creating the new _id value for this document and your "condition" is also automatically inserted to the new document since it was an implied value that should exist.
The usage of the $setOnInsert operator is to say that those fields will only be set when a new document is created. The final part uses $addToSet in order to "push" the book values that have not already been found into the "books" array (or set).
The reason for the separation is for when a document is actually found to exist with the specified "publisher" name. In this case, all of the fields under the $setOnInsert will be ignored as they should already be in the document. So only the $addToSet operation is processed and sent to the server in order to add the new entry to the "books" array (set) and where it does not already exist.
So that would be simplified logic compared to aggregating the new records in code before sending a new insert operation. However it is not very "batch" like as you are still performing some operation to the server for each row.
This is fixed in MongoDB version 2.6 and above as there is now the ability to do "batch" updates. So with a similar analog:
var batch = [];
books.forEach(function(book) {
batch.push({
"q": { "name": book.publisher },
"u": {
"$setOnInsert": {
"founded": book.founded,
"location": book.location,
},
"$addToSet": { "books": book.book }
},
"upsert": true
});
if ( ( batch.length % 500 ) == 0 ) {
db.runCommand( "update", "updates": batch );
batch = [];
}
});
db.runCommand( "update", "updates": batch );
So what is doing in setting up all of the constructed update statements into a single call to the server with a sensible size of operations sent in the batch, in this case once every 500 items processed. The actual limit is the BSON document maximum of 16MB so this can be altered appropriate to your data.
If your MongoDB version is lower than 2.6 then you either use the first form or do something similar to the second form using the existing batch insert functionality. But if you choose to insert then you need to do all the pre-aggregation work within your code.
All of the methods are of course supported with the PHP driver, so it is just a matter of adapting this to your actual code and which course you want to take.
I'd like to make a specific subdocument value from a MondoDb document fixed, so it can not be possible to modify it at a next update, or any other MongoDb operations that can modify documents.
For example, if a document like the one bellow is inserted, I will like that "eyesColor" value can not be changed.
{
"id" : "someId",
"name": "Jane",
"eyesColor" : "blue"
}
A possible update can be:
{
"id" : "someId",
"name": "Amy",
"eyesColor" : "green"
}
And the result I need after this update is :
{
"id" : "someId",
"name": "Amy",
"eyesColor" : "blue"
}
I'd like to do this because the possibility of using $set and $unset operators is not present in the project I'm creating. A read on the existing document before the update, in order to get the value of the subdocument ("eyesColor") will decrease the performance of the application I work on.
Actually the constrain I need is similar to the fixed size on collections (capped collections). The difference is that it is on a subdocument instead of collection and on the value contained in the subdocument instead of the size.
Is there any solution to this type of constrain?
There are no constraints in MongoDB (only exception: unique indexes). There is no way to make fields "read-only" on the database-layer.
When you want to use upsert's (db.collection.update with upsert: true) which add certain fields on inserting new documents but don't affect these fields on updates of existing documents, you can place these fields behind the $setOnInsert-operator.
I want rename to rename my dict key in mongodb.
normally it works like that db.update({'_id':id},{$rename:{'oldfieldname':newfieldname}})
My document structure looks like that
{
'data':'.....',
'field':{'1':{'data':....},'2':{'data'...}},
'more_data':'....',
}
if i want to set
a new field in field 1 i do db.update({'_id':id},{$set:{'field.0.1.name':'peter'}})
for field two it is 'field'.1.2.name'
i thought with the rename it should be similar but it isn't ... (like $rename:{'field'.0.1': 2}
Here's a flexible method for renaming keys in a database
Given a document structure like this...
{
"_id": ObjectId("4ee5e9079b14f74ef14ddd2f"),
"code": "130.4",
"description": "4'' Socket Plug",
"technicalData": {
"Drawing No": "50",
"length": "200mm",
"diameter: "20mm"
},
}
I want to loop through all documents and rename technicalData["Drawing No"] to technicalData["Drawing Number"]
Run the following javascript in the execute panel in (the excellent) RockMongo
function remap(x){
dNo = x.technicalData["Drawing No"];
db.products.update({"_id":x._id}, {
$set: {"technicalData.Drawing Number" : dNo},
$unset: {"technicalData.Drawing No":1}
});
}
db.products.find({"technicalData.Drawing No":{$ne:null}}).forEach(remap);
The code will also run in a mongo shell
Your question is unclear but it seems you'd like to rename a field name within an array.
The short answer is you can't. As stated in the docs, $rename doesn't expand arrays to find a matching name. It only works on top level fields.
What you can do to simulate rename is by copying the field and its data to the new name, and then deleting the original field. You might also need a way to account for potentially concurrent writes if you have a lot of writes to that object/field.