How to save dataframe to Elasticsearch with mappings - scala

I have following code to save dataframe to elastic search. It works well.
val conf = new SparkConf(true).set("spark.cassandra.connection.host", host)
conf.set("spark.es.index.auto.create", "true")
conf.set("spark.es.nodes", host)
val features = sqlContext.read.parquet(input)
features.write.format("org.elasticsearch.spark.sql")
.mode(SaveMode.Append)
.option("es.resource","{ts}/log").save()
It autocreates index when it is not there. But when I try to query on some field. It shows following error
Set fielddata=true on [country] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.
I am aware of the mappings to make text fields as keywords
{
"your_field": {
"type" "keyword",
"index": true
}
}
But I couldn't find how to use these mappings when creating index with this code

In my experience, Elasticsearch for hadoop also creates a .keyword with keyword type already for you!
Try using country.keyword

Related

MongoIO Apache beam GCP Dataflow with Mongo Upsert Pipeline example

I am looking for an example to implement Apache beam GCP dataflow Pipeline to Update the data in Mongo DB using upsert operation i.e if the value exsit it should update the value and if not it should insert.
Syntax like below :
pipeline.apply(...)
.apply(MongoDbIO.write()
.withUri("mongodb://localhost:27017")
.withDatabase("my-database")
.withCollection("my-collection")
.withUpdateConfiguration(UpdateConfiguration.create().withUpdateKey("key1")
.withUpdateFields(UpdateField.fieldUpdate("$set", "source-field1", "dest-field1"),
UpdateField.fieldUpdate("$set","source-field2", "dest-field2"),
//pushes entire input doc to the dest field
UpdateField.fullUpdate("$push", "dest-field3") )));
Below is my Pipeline Code where i am currently inserting the document after preapring the collection like below
{"_id":{"$oid":"619632693261e80017c44145"},"vin":"SATESTCAVA74621","timestamp":"2021-11-18T10:48:59.889Z","key":"EV_CHARGE_NOW_SETTING","value":"DEFAULT"}
Now i want to Update the 'value' and 'timestamp' if the combination of 'vin' and 'key' are present, if 'vin' and 'key' combination is not present then Insert the new document using upsert.
PCollection<PubsubMessage> pubsubMessagePCollection= pubsubMessagePCollectionMap.get(topic);
pubsubMessagePCollection.apply("Convert pubsub to kv,k=vin", ParDo.of(new ConvertPubsubToKVFn()))
.apply("group by vin key",GroupByKey.<String,String>create())
.apply("filter data for alerts, status and vehicle data", ParDo.of(new filterMessages()))
.apply("converting message to document type", ParDo.of(
new ConvertMessageToDocumentTypeFn(list_of_keys_str, collection, options.getMongoDBHostName(),options.getMongoDBDatabaseName())).withSideInputs(list_of_keys_str))
.apply(MongoDbIO.write()
.withUri(options.getMongoDBHostName())
.withDatabase(options.getMongoDBDatabaseName())
.withCollection(collection));
Now if i want to use this below lines of code:
.withUpdateConfiguration(UpdateConfiguration.create().withUpdateKey("key1")
.withUpdateFields(UpdateField.fieldUpdate("$set", "source-field1", "dest-field1"),
UpdateField.fieldUpdate("$set","source-field2", "dest-field2"),
//pushes entire input doc to the dest field
UpdateField.fullUpdate("$push", "dest-field3") )));
What will be my key1, "source-field1", "dest-field1", "source-field2", "dest-field2", "dest-field3" ?
I am confused with this values. Please help !
Below code i am trying to update
MongoDbIO.write()
.withUri(options.getMongoDBHostName())
.withDatabase(options.getMongoDBDatabaseName())
.withCollection(collection)
.withUpdateConfiguration(UpdateConfiguration.create()
.withIsUpsert(true)
.withUpdateKey("vin")
.withUpdateKey("key")
.withUpdateFields(UpdateField.fieldUpdate("$set", "vin", "vin"),
UpdateField.fieldUpdate("$set", "key", "key"),
UpdateField.fieldUpdate("$set", "timestamp", "timestamp"),
UpdateField.fieldUpdate("$set", "value", "value")))
Using above code My document is not updating instead adding with id = vin , it should update based the exsiting record with vin and key match, also if insert it should insert with auto generated _id value.
Please suggest what to do here ?
upsert configuration is read from here, you can configure it with withIsUpsert(true).
In your original syntax, add the extra line to enable upsert.
pipeline.apply(...)
.apply(MongoDbIO.write()
.withUri("mongodb://localhost:27017")
.withDatabase("my-database")
.withCollection("my-collection")
.withUpdateConfiguration(
UpdateConfiguration.create()
.withIsUpsert(true)
.withUpdateKey("key1")
.withUpdateFields(
UpdateField.fieldUpdate("$set", "source-field1", "dest-field1"),
UpdateField.fieldUpdate("$set","source-field2", "dest-field2"),
//pushes entire input doc to the dest field
UpdateField.fullUpdate("$push", "dest-field3"))));

Scala, Quill - how to compare values with case-insensitive?

I created a quill query, which should find some data in database by given parameter:
val toFind = "SomeName"
val query = query.find(value => infix"$value = ${lift(toFind)}".as[Boolean])
It works fine when for example I have data in database "SomeName", but if I want to have same results by passing there "somename" I found nothing. The problem is with data case-sensitive.
Is it possible to always find values with case-insensitive way? In quill docs I have not found anything about it.
Ok, I found a solution. It is enough to add LOWER() sql function to infix:
val query = query.find(value => infix"LOWER($value) = ${lift(toFind.toLowerCase)}".as[Boolean])

Using Elastic4s for Percolator Queries

I'm currently trying to create a percolator query with Elastic4s. I've got about this far but I can't seem to find any examples so I'm not sure how this quite works. So I've got:
val percQuery = percolate in esIndex / esType query myQuery
esClient.execute(percQuery)
Every time it runs it doesn't match anything. I figured out I need to be able to percolate on an Id but I can't seem to find any examples on how to do it, not even in the docs. I know with Elastic4s creating queries other than a percolator query lets you specify an id field like:
val query = index into esIndex / esType source myDoc id 12345
I've tried this way for percolate but it doesn't like the id field, does anyone know how this can be done?
I was using Dispatch Http to do this previously but I'm trying to move away from it. Before, I was doing this to submit the percolator query:
url(s"$esUrl/.percolator/$queryId)
.setContentType("application/json", "utf-8")
.setBody(someJson)
.POST
notice the queryId just need something similar to that but in elastic4s.
So you want to add a document and return the queries that are waiting for that id to be added? That seems an odd use for percolate as it will be a one time use only, as only one document can be added per id. You can't do a percolate currently on id in elastic4s, and I'm not sure if you can even do it in elasticsearch itself.
This is the best attempt I can come up with, where you have your own "id" field, which could mirror the 'proper' _id field.
object Test extends App {
import ElasticDsl._
val client = ElasticClient.local
client.execute {
create index "perc" mappings {
"idtest" as(
"id" typed StringType
)
}
}.await
client.execute {
register id "a" into "perc" query {
termQuery("id", "a")
}
}.await
client.execute {
register id "b" into "perc" query {
termQuery("id", "b")
}
}.await
val resp1 = client.execute {
percolate in "perc/idtest" doc("id" -> "a")
}.await
// prints a
println(resp1.getMatches.head.getId)
val resp2 = client.execute {
percolate in "perc/idtest" doc("id" -> "b")
}.await
// prints b
println(resp2.getMatches.head.getId)
}
Written using elastic4s 1.7.4
So after much more researching I figured out how this works with elastic4s. To do this in Elastic4s you actually have to use register instead of percolate like so:
val percQuery = register id queryId into esIndex query myQuery
This will register a percolator query at the id.

Index hint with mongodb csharp

I am migrating from the mongodb csharp driver 1.10.0 to 2.0.0.
One of the collection I am using is very big and has to fulfill many queries with different filter attributes. That is why I was relying on some index hint statements. With the v1.10 driver it looks like
myCollection.Find(query).SetHint("myIndexName");
I searched the v2 driver api but this hint method seems to be completly removed in the v2 driver. Is there an alternative? How should I do index hints with the v2 driver?
Note: The Solutions provided works for latest mongodb csharp drivers as well
You can use the FindOptions.Modifiers property.
var modifiers = new BsonDocument("$hint", "myIndexName");
await myCollection.Find(query, new FindOptions { Modifiers = modifiers }).ToListAsync();
May I ask why you are using the hint? Was the server consistently choosing the wrong index? You shouldn't need to do this except in exceptional cases.
Craig
Ideally, try to make the query in a way that mongodb optimizer can use the index automatically.
If you are using FindAsync then you will have a property named Hint. Use it like this:
If you have index named "myIndexName" which you want your query should use forcefully, then use like this:.
BsonString bsonString = new BsonString("myIndexName");
cursor = await collection.FindAsync(y => y.Population > 400000000,
new FindOptions<Person, Person>()
{
BatchSize = 200,
NoCursorTimeout = true,
AllowPartialResults = true,
Projection = "{'_id':1,'Name':1,'Population':1}"
Hint = bsonString.AsBsonValue,
}).ConfigureAwait(false);
You can fine BsonString class in MongoDB.Bson
With agregate you can force indice like this:
BsonString bsonString = new BsonString("ix_indice");
var query = this.collection.Aggregate(new AggregateOptions() { Hint = bsonString }).Match(new BsonDocument {..});
If you are using the Linq IQueryable, you can specify the hint (and other options) like this:
BsonDocument hint = new BsonDocument("myFieldName", 1);
// or
BsonDocument hint = new BsonString("myIndexName");
await collection.AsQueryable(new AggregateOptions { Hint = hint })
myFieldName can also reference a complex field, e.g. Metadata.FileName
myIndexName is the name of an index. I prefer to reference the field (first option) directly, instead of an index, for simple cases.

OrientDB - How do I insert a document with connections to multiple other documents?

Using OrientDB 1.7-rc and Scala, I would like to insert a document (ODocument), into a document (not graph) database, with connections to other documents. How should I do this?
I've tried the following, but it seems to insert an embedded list of documents into the Package document, rather than connect the package to a set of Version documents (which is what I want):
val doc = new ODocument("Package")
.field("id", "MyPackage")
.field("versions", List(new ODocument("Version").field("id", "MyVersion")))
EDIT:
I've tried inserting a Package with connections to Versions through SQL, and that seems to produce the desired result:
insert into Package(id, versions) values ('MyPackage', [#10:3, #10:4] )
However, I need to be able to do this from Scala, which has yet to produce the correct results when loading the ODocument back. How can I do it (from Scala)?
You need to create the individual documents first and then inter-link them using below SQL commands.
Some examples given in OrientDB documentation
insert into Profile (name, friends) values ('Luca', [#10:3, #10:4] )
OR
insert into Profile SET name = 'Luca', friends = [#10:3, #10:4]
Check here for more details.
I tried posting in comments above, but somehow the code is not readable, so posting the response separately again.
Here is an example of linking two documents in OrientDB. This is take from documentation. Here we are adding new user in DB and connecting it to give role:
var db = orient.getDatabase();
var role = db.query("select from ORole where name = ?", roleName);
if( role == null ){
response.send(404, "Role not found", "text/plain", "Error: role name not found" );
} else {
db.begin();
try{
var result = db.save({ "#class" : "OUser", name : "Gaurav", password : "gauravpwd", roles : role});
db.commit();
return result;
}catch ( err ){
db.rollback();
response.send(500, "Error: Server", "text/plain", err.toString() );
}
}
Hope it helps you and others.
This is how to insert a Package with a linkset referring to an arbitrary number of Versions:
val version = new ODocument("Version")
.field("id", "1.0")
version.save()
val versions = new java.util.HashSet[ODocument]()
versions.add(version)
val package = new ODocument("Package")
.field("id", "MyPackage")
.field("versions", versions)
package.save()
When inserting a Java Set into an ODocument field, OrientDB understands this to mean one wants to insert a linkset, which is an unordered, unique, collection of references.
When reading the Package back out of the database, you should get hold of its Versions like this:
val versions = doc.field[java.util.HashSet[ODocument]]("versions").asScala.toSeq
As when the linkset of versions is saved, a HashSet should be used when loading the referenced ODocument instances.
Optionally, to enforce that Package.versions is in fact a linkset of Versions, you may encode this in the database schema (in SQL):
create property Package.versions linkset Version