This is a small code I am using to do a simple search:
import com.sksamuel.elastic4s.{ElasticsearchClientUri, ElasticClient}
import com.sksamuel.elastic4s.ElasticDsl._
import org.elasticsearch.common.settings.ImmutableSettings
object Main3 extends App {
val uri = ElasticsearchClientUri("elasticsearch://localhost:9300")
val settings = ImmutableSettings.settingsBuilder().put("cluster.name", "elasticsearch").build()
val client = ElasticClient.remote(settings, uri)
if (client.exists("bands").await.isExists()) {
println("Index already exists!")
val num = readLine("Want to delete the index? ")
if (num == "y") {
client.execute {deleteIndex("bands")}.await
} else {
println("Leaving this here ...")
}
} else {
println("Creating the index!")
client.execute(create index "bands").await
client.execute(index into "bands/artists" fields "name"->"coldplay").await
val resp = client.execute(search in "bands/artists" query "coldplay").await
println(resp)
}
client.close()
}
This is result that I get:
Connected to the target VM, address: '127.0.0.1:51872', transport: 'socket'
log4j:WARN No appenders could be found for logger (org.elasticsearch.plugins).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Creating the index!
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"failed" : 0
},
"hits" : {
"total" : 0,
"max_score" : null,
"hits" : [ ]
}
}
Disconnected from the target VM, address: '127.0.0.1:51872', transport: 'socket'
Process finished with exit code 0
Creation of an index and adding document to this index is running fine but a simple search query is giving no result. I even checked this on Sense.
GET bands/artists/_search
{
"query": {
"match": {
"name": "coldplay"
}
}
}
gives
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.30685282,
"hits": [
{
"_index": "bands",
"_type": "artists",
"_id": "AU21OYO9w-qZq8hmdTOl",
"_score": 0.30685282,
"_source": {
"name": "coldplay"
}
}
]
}
}
How to solve this issue?
I suspect what is happening is that you are doing the search straight after the index operation in your code. However in elasticsearch documents are not ready for search immediately. See refresh interval setting here. (So when you use the rest client, you are waiting a few seconds by virtue of the fact you have to manually flick between tabs, etc).
You could test this quickly by putting a Thread.sleep(3000) after the index. If that confirms it then works, then you need to think about how you want to write your program.
Normally you just index, and when the data is available, then its available. This is called eventual consistency. In the meantime (seconds) users might not have it available to search. That's usually not a problem.
If it IS a problem, then you will have to do some tricks like we do in the unit tests of elastic4s where you keep 'count'ing until you get back the right number of documents.
Finally, you can also manually 'refresh' the index to speed things up, by calling
client.execute {
refresh index "indexname"
}
But that's usually only used when you turn off the automatic refreshing for bulk inserts.
Related
i am trying to setup syncing from mongodb to kudu with debezium mongodb connector. but as debezium doc tell and also i tried by myself and found, there are no filter(_id value) for debezium mongodb CDC update/$set message.
{
"after": null,
"patch": "{\"$v\" : 1,\"$set\" : {\"_upts_ratio_average_points\" : {\"$numberLong\" : \"1564645156749\"},\"updatets\" : {\"$numberLong\" : \"1564645156749\"}}}",
"source": {
"version": "0.9.5.Final",
"connector": "mongodb",
"name": "promongodbdeb05",
"rs": "mgset-13056897",
"ns": "strtest.mg_jsd_result_all",
"sec": 1564645156,
"ord": 855,
"h": -1570214265415439167,
"initsync": false
},
"op": "u",
"ts_ms": 1564648181536
}
I don't understand why designed like this, without filter really no idea which document is updated. I downloaded the source code of this connector and try to fix it. It looks like class io.debezium.connector.mongodb.transforms.UnwrapFromMongoDbEnvelope is where MongoDB op log message is extracted with code like these. And this file is with confusing both _id and id manipulations and it looks like the committer of the connector indeed tried to include _id value in the CDC update message. I tried to change valueDocument.append("id", keyDocument.get("id")); to valueDocument.append("id", keyDocument.get("_id")); still no _id value in CDC message after the connector is rebuilt and deployed.
Anyone familiar with debezium can help me with this?
{
private BsonDocument getUpdateDocument(R patchRecord, BsonDocument keyDocument) {
BsonDocument valueDocument = new BsonDocument();
BsonDocument document = BsonDocument.parse(patchRecord.value().toString());
if (document.containsKey("$set")) {
valueDocument = document.getDocument("$set");
}
if (document.containsKey("$unset")) {
Set<Entry<String, BsonValue>> unsetDocumentEntry = document.getDocument("$unset").entrySet();
for (Entry<String, BsonValue> valueEntry : unsetDocumentEntry) {
// In case unset of a key is false we don't have to do anything with it,
// if it's true we want to set the value to null
if (!valueEntry.getValue().asBoolean().getValue()) {
continue;
}
valueDocument.append(valueEntry.getKey(), new BsonNull());
}
}
if (!document.containsKey("$set") && !document.containsKey("$unset")) {
if (!document.containsKey("_id")) {
throw new ConnectException("Unable to process Mongo Operation, a '$set' or '$unset' is necessary " +
"for partial updates or '_id' is expected for full Document replaces.");
}
// In case of a full update we can use the whole Document as it is
// see https://docs.mongodb.com/manual/reference/method/db.collection.update/#replace-a-document-entirely
valueDocument = document;
valueDocument.remove("_id");
}
if (!valueDocument.containsKey("id")) {
valueDocument.append("id", keyDocument.get("id"));
}
if (flattenStruct) {
final BsonDocument newDocument = new BsonDocument();
valueDocument.forEach((fKey, fValue) -> newDocument.put(fKey.replace(".", delimiter), fValue));
valueDocument = newDocument;
}
return valueDocument;
}
}
#jiri, thanks a lot for your reply, the message i got always like this:
{ "after": null, "patch": "{\"$v\" : 1,\"$set\" : {\"_upts_ratio_average_points\" : {\"$numberLong\" : \"1564645156749\"},\"updatets\" : {\"$numberLong\" : \"1564645156749\"}}}", "source": { "version": "0.9.5.Final", "connector": "mongodb", "name": "promongodbdeb05", "rs": "mgset-13056897", "ns": "strtest.mg_jsd_result_all", "sec": 1564645156, "ord": 855, "h": -1570214265415439167, "initsync": false }, "op": "u", "ts_ms": 1564648181536 }
and i searched and found someone else can get debezium mongodb CDC like this article:
https://rmoff.net/2018/03/27/streaming-data-from-mongodb-into-kafka-with-kafka-connect-and-debezium/
like this:
{
"after": {\"_id\" : {\"$oid\" : \"58385328e4b001431e4e497a\"}, ....
One can see I can't get _id, so no way for me know this change on which document/record, but as to the above post, it looks like the author can get _id, also by checking code, _id should be there. I used both 0.9.5Final and the 0.7.4 rev which used in the above post. both no luck for me, always without _id value.
when consumer the topic,need to add --property key.print=true.It will have the Key.The value is in the key part.Thanks
I'm trying to remove duplicate documents in MongoDB in a large collection according to the approach described here:
db.events.aggregate([
{ "$group": {
"_id": { "firstId": "$firstId", "secondId": "$secondId" },
"dups": { "$push": "$_id" },
"count": { "$sum": 1 }
}},
{ "$match": { "count": { "$gt": 1 } }}
], {allowDiskUse:true, cursor:{ batchSize:100 } }).forEach(function(doc) {
doc.dups.shift();
db.events.remove({ "_id": {"$in": doc.dups }});
});
I.e. I want to remove events that has the same "firstId - secondId" combination. However after a while MongoDB responds with this error:
2016-11-30T14:13:57.403+0000 E QUERY [thread1] Error: getMore command failed: {
"ok" : 0,
"errmsg" : "BSONObj size: 17582686 (0x10C4A5E) is invalid. Size must be between 0 and 16793600(16MB)",
"code" : 10334
}
Is there anyway to get around this? I'm using MongoDB 3.2.6.
The error message indicates that some part of the process is attempting to create a document that is larger than the 16 MB document size limit in MongoDB.
Without knowing your data set, I would guess that the size of the collection is sufficiently large that the number of unique firstId / secondId combinations is growing the result set past the document size limit.
If the size of the collection prevents finding all duplicates values in one operation, you may want to try breaking it up and iterating through the collection and querying to find duplicate values:
db.events.find({}, { "_id" : 0, "firstId" : 1, "secondId" : 1 }).forEach(function(doc) {
cnt = db.events.find(
{ "firstId" : doc.firstId, "secondId" : doc.secondId },
{ "_id" : 0, "firstId" : 1, "secondId" : 1 } // explictly only selecting key fields to allow index to cover the query
).count()
if( cnt > 1 )
print('Dupe Keys: firstId: ' + doc.firstId + ', secondId: ' + doc.secondId)
})
It's probably not the most efficient implementation, but you get the idea.
Note that this approach heavily relies upon the existence of the index { 'firstId' : 1, 'secondId' : 1 }
Im trying to delete an item inside an object categorized inside multiple keys.
for example, deleting ObjectId("c") from every items section
This is the structure:
{
"somefield" : "value",
"somefield2" : "value",
"objects" : {
"/" : {
"color" : "#112233",
"items" : [
ObjectId("c"),
ObjectId("b")
]
},
"/folder1" : {
"color" : "#112233",
"items" : [
ObjectId("c"),
ObjectId("d")
]
},
"/folder2" : {
"color" : "112233",
"items" : []
},
"/testing" : {
"color" : "112233",
"items" : [
ObjectId("c"),
ObjectId("f")
]
}
}
}
I tried with pull and unset like:
db.getCollection('col').update(
{},
{ $unset: { 'objects.$.items': ObjectId("c") } },
{ multi: true }
)
and
db.getCollection('col').update(
{},
{ "objects": {"items": { $pull: [ObjectId("c")] } } },
{ multi: true }
)
Any idea? thanks!
The problem here is largely with the current structure of your document. MongoDB cannot "traverse paths" in an efficient way, and your structure currently has an "Object" ( 'objects' ) which has named "keys". What this means is that accessing "items" within each "key" needs the explicit path to each key to be able to see that element. There are no wildcards here:
db.getCollection("coll").find({ "objects./.items": Object("c") })
And that is the basic principle to "match" something as you cannot do it "across all keys" without resulting to JavaScript code, which is really bad.
Change the structure. Rather than "object keys", use "arrays" instead, like this:
{
"somefield" : "value",
"somefield2" : "value",
"objects" : [
{
"path": "/",
"color" : "#112233",
"items" : [
"c",
"b"
]
},
{
"path": "/folder1",
"color" : "#112233",
"items" : [
"c",
"d"
]
},
{
"path": "/folder2",
"color" : "112233",
"items" : []
},
{
"path": "/testing",
"color" : "112233",
"items" : [
"c",
"f"
]
}
]
}
It's much more flexible in the long run, and also allows you to "index" fields like "path" for use in query matching.
However, it's not going to help you much here, as even with a consistent query path, i.e:
db.getCollection("coll").find({ "objects.items": Object("c") })
Which is better, but the problem still persists that is it not possible to $pull from multiple sources ( whether object or array ) in the same singular operation. And that is augmented with "never" across multiple documents.
So the best you will ever get here is basically "trying" the "muti-update" concept until the options are exhausted and there is nothing left to "update". With the "modified" structure presented then you can do this:
var bulk = db.getCollection("coll").initializeOrderedBulkOp(),
count = 0,
modified = 1;
while ( modified != 0 ) {
bulk.find({ "objects.items": "c"}).update({
"$pull": { "objects.$.items": "c" }
});
count++;
var result = bulk.execute();
bulk = db.getCollection("coll").initializeOrderedBulkOp();
modified = result.nModified;
}
print("iterated: " + count);
That uses the "Bulk" operations API ( actually all shell methods now use it anyway ) to basically get a "better write response" that gives you useful information about what actually happened on the "update" attempt.
The point is that is basically "loops" and tries to match a document based on the "query" portion of the update and then tries to $pull from the matched array index an item from the "inner array" that matches the conditions given to $pull ( which acts as "query" in itself, just upon the array items ).
On each iteration you basically get the "nModified" value from the response, and when this is finally 0, then the operation is complete.
On the sample ( restructured ) given then this will take 4 iterations, being one for each "outer" array member. The updates are "multi" as implied by bulk .update() ( as opposed to .updateOne() ) already, and therefore the "maximum" iterations is determined by the "maximum" array elements present in the "outer" array across the whole collection. So if there is "one" document out of "one thousand" that has 20 entries then the iterations will be 20, and just because that document still has something that can be matched and modified.
The alternate case under your current structure does not bear mentioning. It is just plain "impossible" without:
Retrieving the document individually
Extracting the present keys
Running an individual $pull for the array under that key
Get next document, rinse and repeat
So "multi" is "right out" as an option and cannot be done, without some some possible "foreknowledge" of the possible "keys" under the "object" key in the document.
So please "change your structure" and be aware of the general limitations available.
You cannot possibly do this in "one" update, but at least if the maximum "array entries" your document has was "4", then it is better to do "four" updates over a "thousand" documents than the "four thousand" that would be required otherwise.
Also. Please do not "obfuscate" the ObjectId value in posts. People like to "copy/paste" code and data to test for themselves. Using something like ObjectId("c") which is not a valid ObjectId value would clearly cause errors, and therefore is not practical for people to use.
Do what "I did" in the listing, and if you want to abstract/obfuscate, then do it with "plain values" just as I have shown.
One approach that you could take is using JavaScript native methods like reduce to create the documents that will be used in the update.
You essentially need an operation like the following:
var itemId = ObjectId("55ba3a983857192828978fec");
db.col.find().forEach(function(doc) {
var update = {
"object./.items": itemId,
"object./folder1.items": itemId,
"object./folder2.items": itemId,
"object./testing.items": itemId
};
db.col.update(
{ "_id": doc._id },
{
"$pull": update
}
);
})
Thus to create the update object would require the reduce method that converts an array into an object:
var update = Object.getOwnPropertyNames(doc.objects).reduce(function(o, v, i) {
o["objects." + v + ".items"] = itemId;
return o;
}, {});
Overall, you would need to use the Bulk operations to achieve the above update:
var bulk = db.col.initializeUnorderedBulkOp(),
itemId = ObjectId("55ba3a983857192828978fec"),
count = 0;
db.col.find().forEach(function(doc) {
var update = Object.getOwnPropertyNames(doc.objects).reduce(function(o, v, i) {
o["objects." + v + ".items"] = itemId;
return o;
}, {});
bulk.find({ "_id": doc._id }).updateOne({
"$pull": update
})
count++;
if (count % 1000 == 0) {
bulk.execute();
bulk = db.col.initializeUnorderedBulkOp();
}
})
if (count % 1000 != 0) { bulk.execute(); }
We have a collection that looks like this:
{
"_id" : "10571:6",
"v" : 261355,
"ts" : 4.88387e+008
}
Now, some of the "v" are ints, some are doubles. I want to change them all to doubles.
I've tried a few things but nothing works (v is an int32 for this record, I want to change it to a double):
db.getCollection('VehicleLastValues')
.find
(
{_id : "10572:6"}
)
.forEach
(
function (x)
{
temp = x.v * 1.0;
db.getCollection('VehicleLastValues').save(x);
}}
Things I've tried:
x.v = x.v * 1.1 / 1.1;
x.v = parseFloat (new String(x.v));
But I can't get it to be saved as a double...
By default all "numbers" are stored as "double" in MongoDB unless generally cast overwise.
Take the following samples:
db.sample.insert({ "a": 1 })
db.sample.insert({ "a": NumberLong(1) })
db.sample.insert({ "a": NumberInt(1) })
db.sample.insert({ "a": 1.223 })
This yields a collection like this:
{ "_id" : ObjectId("559bb1b4a23c8a3da73e0f76"), "a" : 1 }
{ "_id" : ObjectId("559bb1bba23c8a3da73e0f77"), "a" : NumberLong(1) }
{ "_id" : ObjectId("559bb29aa23c8a3da73e0f79"), "a" : 1 }
{ "_id" : ObjectId("559bb30fa23c8a3da73e0f7a"), "a" : 1.223 }
Despite the different constructor functions note how several of the data points there look much the same. The MongoDB shell itself doesn't always clearly distinquish between them, but there is a way you can tell.
There is of course the $type query operator, which allows selection of BSON Types.
So testing this with Type 1 - Which is "double":
> db.sample.find({ "a": { "$type": 1 } })
{ "_id" : ObjectId("559bb1b4a23c8a3da73e0f76"), "a" : 1 }
{ "_id" : ObjectId("559bb30fa23c8a3da73e0f7a"), "a" : 1.223 }
You see that both the first insert and the last are selected, but of course not the other two.
So now test for BSON Type 16 - which is a 32-bit integer
> db.sample.find({ "a": { "$type": 16 } })
{ "_id" : ObjectId("559bb29aa23c8a3da73e0f79"), "a" : 1 }
That was the "third" insertion which used the NumberInt() function in the shell. So that function and other serialization from your driver can set this specific BSON type.
And for the BSON Type 18 - which is 64-bit integer
> db.sample.find({ "a": { "$type": 18 } })
{ "_id" : ObjectId("559bb1bba23c8a3da73e0f77"), "a" : NumberLong(1) }
The "second" insertion which was contructed via NumberLong().
If you wanted to "weed out" things that were "not a double" then you would do:
db.sample.find({ "$or": [{ "a": { "$type": 16 } },{ "a": { "$type": 18 } }]})
Which are the only other valid numeric types other than "double" itself.
So to "convert" these in your collection, you can "Bulk" process like this:
var bulk = db.sample.initializeUnorderedBulkOp(),
count = 0;
db.sample.find({
"$or": [
{ "a": { "$type": 16 } },
{ "a": { "$type": 18 } }
]
}).forEach(function(doc) {
bulk.find({ "_id": doc._id })
.updateOne({
"$set": { "b": doc.a.valueOf() } ,
"$unset": { "a": 1 }
});
bulk.find({ "_id": doc._id })
.updateOne({ "$rename": { "b": "a" } });
count++;
if ( count % 1000 == 0 ) {
bulk.execute()
bulk = db.sample.initializeUnOrderedBulkOp();
}
})
if ( count % 1000 != 0 ) bulk.execute();
What that does is performed in three steps "in bulk":
Re-cast the value to a new field as a "double"
Remove the old field with the unwanted type
Rename the new field to the old field name
This is necessary since the BSON type information is "sticky" to the field element once created. So in order to "re-cast" you need to completely remove the old data which includes the original field assignment.
So that should explain how to "detect" and also "re-cast" unwanted types in your documents.
I'm using using Mongodb 2.6.1
I have a 4 node solution across two data centres (2 dbs, 2 arbiters but one arbiter is always out of the replicate set)
{
"_id": "prelRS",
"members": [
{
"_id": 1,
"host": "serverInDataCenter1:27011",
"priority": 6
},
{
"_id": 3,
"host": "serverInDataCenter2:27013",
"priority": 0
},
{
"_id": 5,
"host": "serverInDataCenter1:27015",
"arbiterOnly": true
}
]
}
when we have a DR situation and need to use the DataCenter2 only, we will
and when I try to take the primary out of the relica set and make the secondary be the primary it takes two attempts to force the configuration to apply, for what seems like transient state issues. below was applied to the 27013 node, and all done in the space of a few seconds.
prelRS:SECONDARY> cfg={
... "_id": "prelRS",
... "members": [
... {
... "_id": 3,
... "host": "serverInDataCenter2:27013",
... "priority": 4
... },
... {
... "_id": 6,
... "host": "serverInDataCenter2:27016",
... "arbiterOnly": true
... }
... ]
... }
{
"_id" : "prelRS",
"members" : [
{
"_id" : 3,
"host" : "serverInDataCenter2:27013",
"priority" : 4
},
{
"_id" : 6,
"host" : "serverInDataCenter2:27016",
"arbiterOnly" : true
}
]
}
prelRS:SECONDARY> rs.reconfig(cfg, {force : true})
{
"errmsg" : "exception: need most members up to reconfigure, not ok : serverInDataCenter2:27016",
"code" : 13144,
"ok" : 0
}
prelRS:SECONDARY> rs.reconfig(cfg, {force : true})
{ "ok" : 1 }
prelRS:SECONDARY>
this also seems to be the case when I am adding the 27011 node back in as well (as a lower priority replica) from the 27013 node
cfg={
"_id": "prelRS",
"members": [
{
"_id": 1,
"host": "serverInDataCenter1:27011",
"priority": 2
},
{
"_id": 3,
"host": "serverInDataCenter2:27013",
"priority": 4
},
{
"_id": 5,
"host": "serverInDataCenter1:27015",
"arbiterOnly": true
}
]
}
prelRS:PRIMARY> rs.reconfig(cfg)
{
"errmsg" : "exception: need most members up to reconfigure, not ok : serverInDataCenter1:27015",
"code" : 13144,
"ok" : 0
}
prelRS:PRIMARY> rs.reconfig(cfg)
2014-08-08T20:53:03.192+0100 DBClientCursor::init call() failed
2014-08-08T20:53:03.193+0100 trying reconnect to 127.0.0.1:27013 (127.0.0.1) failed
2014-08-08T20:53:03.193+0100 reconnect 127.0.0.1:27013 (127.0.0.1) ok
reconnected to server after rs command (which is normal)
but it doesn't seem to happen when i make this node primary again with the first config i mentioned (applied on the 27011 node)
but this action isn't adding arbiters to the set, so maybe that's a clue as to what is going on?
I realize now that I probably need to leave 27011 in the replica set as priority 0 during DR, even though it is not available, but if all of dataCenter1 was not available, i would still have to add 27016 to the set and take 27105 out, and would face the error above when invoking DR
any suggestions why this takes two attempts to work?
thanks
The problem is that your arbiters aren't "real" replica set members, so for some operations you're not going to be able to reach a quorum on the set.