How can I write to mongo using spark considering the following scenarios :
If the document is present, just update the matching fields with newer value and if the field is absent, add the new field. (The replaceDocument parameter if false will update the matching records but not add the new unmatched fields while if set to true, my old fields can get lost.)
I want to keep a datafield as READ-ONLY, example there are two fields, first_load_date and updated_on. first_load_date should never change, it is the day that record is created in mongo, and updated_on is when new fields are added or older ones replaced.
If document is absent, insert.
Main problem is replaceDocument = True will lead to loss of older fields not present in newer row, while False, will take care of matched but now the newer incoming fields.
I am using Mongo-Spark-Connector 2.4.1
df.write.format("mongo").mode("append").option("replaceDocument","true").option("database","db1").option("collection","my_collection").save()
I understood what you are trying to achieve here:
You can use something like :
(df
.write
.format("mongo")
.mode("append")
.option("ordered", "false")
.option("replaceDocument", "false")
.option("database", "db1")
.option("collection", "my_collection")
.save()
)
replaceDocument set to false will help you in preserving your old records and updating the matched ones while the you can get a BulkWriteException for which the ordered parameter set to false will help.
Related
I'm trying to query elasticsearch with the elasticsearch-spark connector and I want to return only few results:
For example:
val conf = new SparkConf().set("es.nodes","localhost").set("es.index.auto.create", "true").setMaster("local")
val sparkContext = new SparkContext(conf)
val query = "{\"size\":1}"
println(sparkContext.esRDD("index_name/type", query).count())
However this will return all the documents in the index.
Some parameters are actually ignored from the query by design, such as : from, size, fields, etc.
They are used internally inside the elasticsearch-spark connector.
Unfortunately this list of unsupported parameters isn't documented. But if you wish to use the size parameter you can always rely on the pushdown predicate and use the DataFrame/Dataset limit method.
So you ought using the Spark SQL DSL instead e.g :
val df = sqlContext.read.format("org.elasticsearch.spark.sql")
.option("pushdown","true")
.load("index_name/doc_type")
.limit(10) // instead of size : 10
This query will return the first 10 documents returned by the match_all query that is used by default by the connector.
Note: The following isn't correct on any level.
This is actually on purpose. Since the connector does a parallel query, it also looks at the number of documents being returned so if the user specifies a parameter, it will overwrite it according to the es.scroll.limit setting (see the configuration option).
When you query elasticsearch it also run the query in parallel on all the index shards without overwriting them.
If I understand this correctly, you are executing a count operation, which does not return any documents. Do you expect it to return 1 because you specified size: 1? That's not happening, which is by design.
Edited to add:
This is the definition of count() in elasticsearch-hadoop:
override def count(): Long = {
val repo = new RestRepository(esCfg)
try {
return repo.count(true)
} finally {
repo.close()
}
}
It does not take the query into account at all, but considers the entire ES index as the RDD input.
This is actually on purpose. Since the connector does a parallel query, it also looks at the number of documents being returned so if the user specifies a parameter, it will overwrite it according to the es.scroll.limit setting (see the configuration option).
In other words, if you want to control the size, do so through that setting as it will always take precedence.
Beware that this parameter applies per shard. So, if you have 5 shards you might bet up to fice hits if this parameter is set to 1.
See https://www.elastic.co/guide/en/elasticsearch/hadoop/master/configuration.html
In my application I have multiple records that contain username and domain.
Before, I used to keep all records when they have different value in version field but now I want to replace them all with just one version.
For example, I have device collection that is structured as:
{
username: me,
domain: stackoverflow.com,
version: 1
}
{
username: me,
domain: stackoverflow.com,
version: 2
}
And kept upserting whenever there's new version.
And now I would like to have only one record that replaces all existing documents. Whenever new document with new version is upserted, all records that match username and domain will be gone and merged into the new one.
I tried upsert: true and multi: true option but it does not delete the old records.
Any help would be great.
Upsert won't delete the old records. It can only replace or create new. (documentation for upsert)
You'll need to clean up the old data manually. There are a few options:
Wait till you encounter a what before would have been a new version of the document and remove all versions before saving the new version. (Clean up old, then put down new).
Use the aggregation framework to group on username and domain to return a list of all combinations. Then, for each combination, eliminate all but the newest (you could sort on version to get the highest and then do a query where you use $ne to remove all that match everything but the highest version number). While this will really hit your database hard, you'd only need to do it once.
Filter the data manually in your favorite programming language and move the data to a new collection. Again, slow, but you'd do it only once.
If you don't care which one of the duplicates is kept, you can do this by creating a unique index over the two fields and specify the dropDups: true option when calling ensureIndex like this:
db.device.ensureIndex({username: 1, domain: 1}, {unique: true, dropDups: true})
This will force MongoDB to create the unique index by deleting documents with duplicate values leaving just one of each username/domain pairing (which seems to be just what you're looking for).
I need to create a document in mongodb and then immediately want it available in my application. The normal way to do this would be (in Python code):
doc_id = collection.insert({'name':'mike', 'email':'mike#gmail.com'})
doc = collection.find_one({'_id':doc_id})
There's two problems with this:
two requests to the server
not atomic
So, I tried using the find_and_modify operation to effectively do a "create and return" with the help of upserts like this:
doc = collection.find_and_modify(
# so that no doc can be found
query= { '__no_field__':'__no_value__'},
# If the <update> argument contains only field and value pairs,
# and no $set or $unset, the method REPLACES the existing document
# with the document in the <update> argument,
# except for the _id field
document= {'name':'mike', 'email':'mike#gmail.com'},
# since the document does not exist, this will create it
upsert= True,
#this will return the updated (in our case, newly created) document
new= True
)
This indeed works as expected. My question is: whether this is the right way to accomplish a "create and return" or is there any gotcha that I am missing?
What exactly are you missing from a plain old regular insert call?
If it is not knowing what the _id will be, you could just create the _id yourself first and insert the document. Then you know exactly how it will look like. None of the other fields will be different from what you sent to the database.
If you are worried about guarantees that the insert will have succeeded you can check the return code, and also set a write concern that provides enough assurances (such as that it has been flushed to disk or replicated to enough nodes).
I am trying to use upsert in MongoDB to update a single field in a document if found OR insert a whole new document with lots of fields. The problem is that it appears to me that MongoDB either replaces every field or inserts a subset of fields in its upsert operation, i.e. it can not insert more fields than it actually wants to update.
What I want to do is the following:
I query for a single unique value
If a document already exists, only a timestamp value (lets call it 'lastseen') is updated to a new value
If a document does not exists, I will add it with a long list of different key/value pairs that should remain static for the remainder of its lifespan.
Lets illustrate:
This example would from my understanding update the 'lastseen' date if 'name' is found, but if 'name' is not found it would only insert 'name' + 'lastseen'.
db.somecollection.update({name: "some name"},{ $set: {"lastseen": "2012-12-28"}}, {upsert:true})
If I added more fields (key/value pairs) to the second argument and drop the $set, then every field would be replaced on update, but would have the desired effect on insert. Is there anything like $insert or similar to perform operations only when inserting?
So it seems to me that I can only get one of the following:
The correct update behavior, but would insert a document with only a subset of the desired fields if document does not exist
The correct insert behavior, but would then overwrite all existing fields if document already exists
Are my understanding correct? If so, is this possible to solve with a single operation?
MongoDB 2.4 has $setOnInsert
db.somecollection.update(
{name: "some name"},
{
$set: {
"lastseen": "2012-12-28"
},
$setOnInsert: {
"firstseen": <TIMESTAMP> # set on insert, not on update
}
},
{upsert:true}
)
There is a feature request for this ( https://jira.mongodb.org/browse/SERVER-340 ) which is resolved in 2.3. Odd releases are actually dev releases so this will be in the 2.4 stable.
So there is no real way in the current stable versions to do this yet. I am afraid the only method is to actually do 3 conditional queries atm: 1 to check the row, then a if to either insert or update.
I suppose if you had real problems with lock here you could do this function with sole JS but that's evil however it would lock this update to a single thread.
I am trying to use upsert in MongoDB to update a single field in a document if found OR insert a whole new document with lots of fields. The problem is that it appears to me that MongoDB either replaces every field or inserts a subset of fields in its upsert operation, i.e. it can not insert more fields than it actually wants to update.
What I want to do is the following:
I query for a single unique value
If a document already exists, only a timestamp value (lets call it 'lastseen') is updated to a new value
If a document does not exists, I will add it with a long list of different key/value pairs that should remain static for the remainder of its lifespan.
Lets illustrate:
This example would from my understanding update the 'lastseen' date if 'name' is found, but if 'name' is not found it would only insert 'name' + 'lastseen'.
db.somecollection.update({name: "some name"},{ $set: {"lastseen": "2012-12-28"}}, {upsert:true})
If I added more fields (key/value pairs) to the second argument and drop the $set, then every field would be replaced on update, but would have the desired effect on insert. Is there anything like $insert or similar to perform operations only when inserting?
So it seems to me that I can only get one of the following:
The correct update behavior, but would insert a document with only a subset of the desired fields if document does not exist
The correct insert behavior, but would then overwrite all existing fields if document already exists
Are my understanding correct? If so, is this possible to solve with a single operation?
MongoDB 2.4 has $setOnInsert
db.somecollection.update(
{name: "some name"},
{
$set: {
"lastseen": "2012-12-28"
},
$setOnInsert: {
"firstseen": <TIMESTAMP> # set on insert, not on update
}
},
{upsert:true}
)
There is a feature request for this ( https://jira.mongodb.org/browse/SERVER-340 ) which is resolved in 2.3. Odd releases are actually dev releases so this will be in the 2.4 stable.
So there is no real way in the current stable versions to do this yet. I am afraid the only method is to actually do 3 conditional queries atm: 1 to check the row, then a if to either insert or update.
I suppose if you had real problems with lock here you could do this function with sole JS but that's evil however it would lock this update to a single thread.