Update specific field in ElasticSearch with Spark - scala

Is there any way to update specific fields in ElasticSearch using Spark? I've been looking around everywhere and I can't find any answer to this. My script is written in Scala and I just can't seem to find any other way other than as a query to update data fields. As an example, I have this data:
"source" : {
"id" : 1
"name" : "John"
"location" : "Los Angeles, CA"
}
Now I want to just change the location to New York so it looks like this:
"source" : {
"id" : 1
"name" : "John"
"location" : "New York, NY"
}
Any help would be appreciated. Thanks

The only way I used till now to update a specific record in Elastic using Spark is by setting option es.mapping.id which let's you specify which field from the record is going to be used as ID to mach against existing record and to decide if it's a new record or update of the existing one. So what you would need to do is to get the ID of existing record, update and then set the ID into a field which is going to be defined in es.mapping.id.
But I'm pretty sure there is other way also, I used this one because I needed to use saveJsonToEs, but saveToEsWithMeta should let you define record ID in some other way and probably more directly, without an additional field, try to google examples.

Related

How to update data in elasticsearch with using like bulkupdate in mongoDB?

I find solution to update data in elasticsearch with golang. The data is about 1,000,000+++ documents and must be specific with id of document. I can update in mongoDB with using bulk operation but I can't find it in elasticsearch it is have a operation like it? or anyony have idea to update huge of data in elasticsearch with specific id. Thanks in advance.
In general, you can use bulk API to make such bulk updates. You can either index data again using same id or just run update. You can use CURL to push the updates from command line, if you are doing it as one off update.
POST _bulk
{ "index" : { "_index" : "test", "_id" : "1" } }
{ "field1" : "value1" }
{ "update" : {"_id" : "1", "_index" : "test"} }
{ "doc" : {"field2" : "value2"} }
Other option is to use update_by_query, if you are setting custom fields. With update by query, you can also mix it with pipeline to update existing data.
It entirely comes down whether you are trying to run update using information from different index (in such case, you can use enrich processor, which is available in 7.5 onwards) OR if you simply want to add a new field and update it using some rule which already uses attributes available on the document.
So for different type of scenario, different options are available. Bulk API is more appropriate, when the data source is external. But if data is already available on Elasticsearch, then update by query is appropriate.
You can also look at reindexing with pipeline scripting. But again, horses for courses rule applies here as well.

Using "id" in addition to "_id" in a Document Model

I've created a document model where I'm using a field named id in addition to MongoDB's auto-generated _id field.
Will this cause any problems for me down the line?
I can imagine a circumstance where something in Mongo might assume my "id" property is referring to "_id" instead, when it isn't (like some API feature with the good intention of preventing you from having to type that underscore, where in my case such a well-meaning feature would be a disaster).
Will this be ok?

Updating mongoData with MongoSpark

From the following tutorial provided by Mongo:
MongoSpark.save(centenarians.write.option("collection", "hundredClub").mode("overwrite"))
am I correct in understanding that What is essentially happening is that Mongo is first dropping the collection, and then its overwritting that collection with the new data?
My question is then is it possible to use the MongoSpark connector to actually update records in Mongo,
lets say I've got data that looks like
{"_id" : ObjectId(12345), "name" : "John" , "Occupation" : "Baker"}
What I would then like to do is to merge the record of the person from another file that has more details, I.E. that file looks like
{"name" : "John", "address" : "1800 some street"}
the goal is to update the record in Mongo so now the JSON looks like
{"_id" : ObjectId(12345) "name" : "John" , "address" : 1800 some street", "Occupation" : "Baker"}
Now here's the thing, lets assume that we just want to update John, and that there are millions of other records that we would like to leave as is.
There are a few questions here, I'll try to break them down.
What is essentially happening is that Mongo is first dropping the collection, and then its overwritting that collection with the new data?
Correct, as of mongo-spark v2.x, if you specify mode overwrite, MongoDB Connector for Spark will first drop the collection the save new result into the collection. See source snippet for more information.
My question is then is it possible to use the MongoSpark connector to actually update records in Mongo,
The patch described on SPARK-66 (mongo-spark v1.1+) is , if a dataframe contains an _id field, the data will be upserted. Which means any existing documents with the same _id value will be updated and new documents without existing _id value in the collection will be inserted. 
What I would then like to do is to merge the record of the person from another file that has more details
As mentioned above, you need to know the _id value from your collection. Example steps:
Create a dataframe (A) by reading from your Person collection to retrieve John's _id value. i.e. ObjectId(12345).
Merge _id value of ObjectId(12345) into your dataframe (B - from the other file with more information). Utilise unique field value to join the two dataframes (A and B).
Save the merged dataframe (C). Without specifying overwrite mode.
we just want to update John, and that there are millions of other records that we would like to leave as is.
In that case, before you merge the two dataframes, filter out any unwanted records from dataframe B (the one from the other file with more details). In addition, when you call save(), specify mode append.

In Mongo is it possible to find documents with the same values for multiple fields?

I've been working with aggregates and can get it to work with one field but I cannot get i to work witht he use case in the title.
I have a collection of DVDs. I need to run a query that tries to identify duplicate DVDs based on three fields :
Heres an example document :
DVD
name : "Fargo",
director : "Cohen Brothers",
genre : "crime"
I want to use an aggregate that groups and returns documents whose three fields match but have been unable to do so. Is it even possible?
An example output based on similar functions, if there were 3 Fargos as above would be :
[['_id':'Fargo', 'size':3],['_id':'12 Angry Men', 'size':1] ]
I found the answer straight after asking the question :/
This has been already answered on S/O here :
Mongodb Aggregation Framework | Group over multiple values?
You can group by a Map as follows :
$group : {
_id : { name:'$name', director:'$director', genre:'$genre'}
}

Upsert an embedded array at specific position - will my work-around work in production?

I'm storing timeseries in MongoDB and the strucuture is as follows:
{
"_id" : ObjectId("5128e567df6232180e00fa7d"),
"values" : [563.424, 520.231, 529.658, 540.459, 544.271, 512.641, 579.591, 613.878, 627.708, 636.239, 672.883, 658.895, 646.44, 619.644, 623.543, 600.527, 619.431, 596.184, 604.073, 596.556, 590.898, 559.334, 568.09, 568.563],
"day" : 20110628,
}
The values-array is representing a value for each hour. So the position is important since position 0 = first hour, 1 = second hour and so on.
To update the value of a specific hour is quite easy. For example, to update the 7th hour of the day I do this:
db.timeseries.update({day:20130203},{$set : {values.6 : 482.65}}, {upsert : true})
My problem is that I would like to use upsert, like this
db.timeseries.update({day:20130203},{$set : {values.6 : 482.65}})
But if the document does not exist, MongoDB will craete an embedded document intead of an embedded array. Like this:
{
"_id" : ObjectId("5128e567df6232180e00fa7d"),
"values" : {"6" : 482.65},
"day" : 20130203,
}
There is a ticket to add a feature to solve this issue here, but meanwhile I have come up with a work-around to solve this in my case.
What I do, is that I first created a uniqe-index on the day-field. And whenever I want to upsert a hourly volume I do these two commands.
db.timeseries.insert({day:20130203, values : []}); // Will be rejected if it exists
db.timeseries.update({day:20130203},{$set : {values.6 : 482.65}});
The first statement tried to create a new document - and thanks to the uniqe-index the insert will be rejected if it already exists. If not, a document with an embedded array for value-field will be created. This ensures that the update will work.
Result:
{
"_id" : ObjectId("5128e567df6232180e00fa7d"),
"values" : [null,null,null,null,null,null,482.65],
"day" : 20130203,
}
And here's is my question
In production, when several commands like this will be run simultaneously can I be sure that my update-command will be executed after my insert-command? Note that I want to run both commands in unsafe-mode, that is I will not wait for any response from the server.
(It would also be interesting to here comments about my work-around from a performance perspective.)
Generally yes, there is a way to ensure that two requests from a client use the same connection. By using the same connection you force a strict order of execution on the server.
The way to accomplish this are different for different drivers.
For the Asynchronous Java Driver you can create a "Serialized" MongoClient from the initial MongoClient instance and it will ensure that all requests use a single connection.
For the 10gen java driver it will automatically (via a ThreadLocal) try to use the same connection. You can also give a hint to the driver via the DB.requestStart()/DB.requestEnd() methods that a group of commands need to be pipe-lined.
The startRequest/endRequest applies to most of the 10gen drivers. As another example the PyMongo driver mongo_client has a start_request()/end_request() pair.
From a performance point of view, it is better using only one access to the database than two. Cannot you use $push instead of $set for updating the values field?