Elasticsearch completion suggester, custom ordering - autocomplete

Is there a way to achieve custom ordering in the Elasticsearch feature "Completion suggester" without providing a separate field containing a "weight" attribute during indexing?
My goal is to sort the suggestions on string length (or similar relevance), so that short words is scored higher than longer words.
Here is a mapping example of a field I want to provide autocomplete function on:
...
skills: {
type: "nested",
include_in_parent: true,
properties: {
name: {
type: "multi_field",
fields: {
name: {type: "string"},
original: {type : "string", analyzer : "string_lowercase", include_in_all : false},
suggest: {type: "completion", index_analyzer: "simple", search_analyzer: "simple"}
}
}
}
}
...
Here I provide several different fields, the "original" field is for filtering in queries, the "suggest" field is used for the completion suggester.
My data when indexing could look like this:
...
"skills": [
{
"name": "C",
"source": [
"linkedin"
]
},
{
"name": "Computer Science",
"source": [
"stackoverflow"
]
},
]
...
What I'm trying to avoid is having to provide an extra field when indexing and add weighting for this myself, like this:
...
"skills_suggest": [
{
"input": ["C"],
"output": "C",
"weight" : 100
},
{
"input": ["Computer Science"],
"output": "Computer Science",
"weight" : 50
}
]
...
I'm trying to do this because I use a autocomplete function on a searchfield in my application. I show the 10 highest hits in a dropdown below the searchfield. My problem is when I have so many skills to choose from, that when I type in the letter "C", the actual skill that is the programming language "C" is scored lower than longer words and therefore is not showing up in my dropdown top 10.

Related

Update a nested field with an unknown index and without affecting other entries

I have a collection with a layout that looks something like this:
student1 = {
"First_Name": "John",
"Last_Name": "Doe",
"Courses": [
{
"Course_Id": 123,
"Course_Name": "Computer Science",
"Has_Chosen_Modules": false
},
{
"Course_Id": 284,
"Course_Name": "Mathematics",
"Has_Chosen_Modules": false
}
]
};
I also have the following update query:
db.Collection_Student.update(
{
$and: [
{First_Name: "John"},
{Last_Name: "Doe"}
]
},
{
$set : { "Courses.0.Has_Chosen_Modules" : true }
}
);
This code will currently update the Computer Science Has_Chosen_Modules value to true since the index is hardcoded. However, what if I wanted to update the value of Has_Chosen_Modules via the Course_Id instead (as the course might not necessarily be at the same index every time)? How would I achieve this without it affecting the other courses that a given student is taking?
You can select any item in the sub array of your document by targeting any property in the sub array of your document by using dot .
You can easily achieve this by the following query.
db.Collection_Student.update(
{
First_Name: "John",
Last_Name: "Doe",
'Courses.Course_Id': 123
},
{
$set : { "Courses.$.Has_Chosen_Modules" : true }
}
);
Conditions in search filter are by default treated as $and operator, so you don't need to specifically write $and for this simple query.

Search values using Index in mongodb

I am new to Mongodb and wish to implement search on field in mongo collection.
I have the following structure for my test collection:-
{
'key': <unique key>,
'val_arr': [
['laptop', 'macbook pro', '16gb', 'i9', 'spacegrey'],
['cellphone', 'iPhone', '4gb', 't2', 'rose gold'],
['laptop', 'macbook air', '8gb', 'i5', 'black'],
['router', 'huawei', '10x10', 'white'],
['laptop', 'macbook', '8gb', 'i5', 'silve'],
}
And I wish to find them based on index number and value, i.e.
Find the entry where first element in any of the val_arr is laptop and 3rd element's value is 8gb.
I tried looking at composite indexes in mongodb, but they have a limit of 32 keys to be indexed. Any help in this direction is appreciated.
There is a limit on indexes here but it really should not matter. In your case you actually say 'key': <unique key>. So if that really is "unique" then it's the only thing in the collection that need be indexed, as long as you actually include that "key" as part of every query you make since this will determine you to select a document.
Indexes on arrays "within" a document really don't matter that much unless you actually intend to search directly for those elements within a document. That might be the case, but this actually has no bearing on matching your values by numbered index positions:
db.collection.find(
{
"val_arr": {
"$elemMatch": { "0": "laptop", "2": "8gb" }
}
},
{ "val_arr.$": 1 }
)
Which would return:
{
"val_arr" : [
[
"laptop",
"macbook air",
"8gb",
"i5",
"black"
]
]
}
The $elemMatch allows you to express "multiple conditions" on the same array element. This is needed over standard dot notation forms because otherwise the condition is simply looking for "any" array member which matches the value at the index. For instance:
db.collection.find({ "val_arr.0": "laptop", "val_arr.2": "4gb" })
Actually matches the given document even though that "combination" does not exist on a single "row", but both values are actually present in the array as a whole. But just in different members. Using those same values with $elemMatch makes sure the pair is matched on the same element.
Note the { "val_arr.$": 1 } in the above example, which is the projection for the "single" matched element. That is optional, but this is just to talk about identifying the matches.
Using .find() this is as much as you can do and is a limitation of the positional operator in that it can only identify one matching element. The way to do this for "multiple matches" is to use aggregate() with $filter:
db.collection.aggregate([
{ "$match": {
"val_arr": {
"$elemMatch": { "0": "laptop", "2": "8gb" }
}
}},
{ "$addFields": {
"val_arr": {
"$filter": {
"input": "$val_arr",
"cond": {
"$and": [
{ "$eq": [ { "$arrayElemAt": [ "$$this", 0 ] }, "laptop" ] },
{ "$eq": [ { "$arrayElemAt": [ "$$this", 2 ] }, "8gb" ] }
]
}
}
}
}}
])
Which returns:
{
"key" : "k",
"val_arr" : [
[
"laptop",
"macbook air",
"8gb",
"i5",
"black"
],
[
"laptop",
"macbook",
"8gb",
"i5",
"silve"
]
]
}
The initial query conditions which actually select the matching document go into the $match and are exactly the same as the query conditions shown earlier. The $filter is applied to just get the elements which actually match it's conditions. Those conditions do a similar usage of $arrayElemAt inside the logical expression as to how the index values of "0" and "2" are applies in the query conditions itself.
Using any aggregation expression incurs an additional cost over the standard query engine capabilities. So it is always best to consider if you really need it before you dive and and use the statement. Regular query expressions are always better as long as they do the job.
Changing Structure
Of course whilst it's possible to match on index positions of an array, none of this actually helps in being able to actually create an "index" which can be used to speed up queries.
The best course here is to actually use meaningful property names instead of plain arrays:
{
'key': "k",
'val_arr': [
{
'type': 'laptop',
'name': 'macbook pro',
'memory': '16gb',
'processor': 'i9',
'color': 'spacegrey'
},
{
'type': 'cellphone',
'name': 'iPhone',
'memory': '4gb',
'processor': 't2',
'color': 'rose gold'
},
{
'type': 'laptop',
'name': 'macbook air',
'memory': '8gb',
'processor': 'i5',
'color': 'black'
},
{
'type':'router',
'name': 'huawei',
'size': '10x10',
'color': 'white'
},
{
'type': 'laptop',
'name': 'macbook',
'memory': '8gb',
'processor': 'i5',
'color': 'silve'
}
]
}
This does allow you "within reason" to include the paths to property names within the array as part of a compound index. For example:
db.collection.createIndex({ "val_arr.type": 1, "val_arr.memory": 1 })
And then actually issuing queries looks far more descriptive in the code than cryptic values of 0 and 2:
db.collection.aggregate([
{ "$match": {
"val_arr": {
"$elemMatch": { "type": "laptop", "memory": "8gb" }
}
}},
{ "$addFields": {
"val_arr": {
"$filter": {
"input": "$val_arr",
"cond": {
"$and": [
{ "$eq": [ "$$this.type", "laptop" ] },
{ "$eq": [ "$$this.memory", "8gb" ] }
]
}
}
}
}}
])
Expected results, and more meaningful:
{
"key" : "k",
"val_arr" : [
{
"type" : "laptop",
"name" : "macbook air",
"memory" : "8gb",
"processor" : "i5",
"color" : "black"
},
{
"type" : "laptop",
"name" : "macbook",
"memory" : "8gb",
"processor" : "i5",
"color" : "silve"
}
]
}
The common reason most people arrive at a structure like you have in the question is typically because they think they are saving space. This is not simply not true, and with most modern optimizations to the storage engines MongoDB uses it's basically irrelevant over any small gains that might have been anticipated.
Therefore, for the sake of "clarity" and also in order to actually support indexing on the data within your "arrays" you really should be changing the structure and use named properties here instead.
And again, if your entire usage pattern of this data is not using the key property of the document in queries, then it probably would be better to store those entries as separate documents to begin with instead of being in an array at all. That also makes getting results more efficient.
So to break that all down your options here really are:
You actually always include key as part of your query, so indexes anywhere else but on that property do not matter.
You change to using named properties for the values on the array members allowing you to index on those properties without hitting "Multikey limitations"
You decide you never access this data using the key anyway, so you just write all the array data as separate documents in the collection with proper named properties.
Going with one of those that actually suits your needs best is essentially the solution allowing you to efficiently deal with the sort of data you have.
N.B Nothing to do with the topic at hand really ( except maybe a note on storage size ), but it would generally be recommended that things with an inherent numeric value such as the memory or "8gb" types of data actually be expressed as numeric rather than "strings".
The simple reasoning is that whilst you can query for "8gb" as an equality, this does not help you with ranges such as "between 4 and 12 gigabytes.
Therefore it usually makes much more sense to use numeric values like 8 or even 8000. Note that numeric values will actually have an impact on storage in that they will typically take less space than strings. Which given that the omission of property names may have been attempting to reduce storage but does nothing, does show an actual area where storage size can be reduced as well.

Put properties with different name in one field in MongoDB

I am getting requests from different devices as Json. Some of them show temperature as "T", some other as "temp" and it can be different in other devices. is that possible to define in MongoDB to put all of these values in single field "temperature"?
Doesn't matter if it is "temp" or "T" or "tempC", just put all of them in "temperature" field.
Here is an example of my data:
[
{ "ip": "12:3B:6A:1A:E6:8B", "type": 0, "t": 37},
{ "ip": "22:33:66:1A:E6:8B", "type": 1, "temperature": 40},
{ "ip": "1A:3C:6A:1A:E6:8B", "type": 1, "temp": 30}
]
I want to put temp, t and temperature in Temperature field in my collection.
You can use $ifNull operator to control which value should be transferred into your output, like below:
db.col.aggregate([
{
$addFields: { Temperature: { $ifNull: [ { $ifNull: [ "$t", "$temperature"] }, "$temp" ] } }
},
{
$project: {
t: 0,
temperature: 0,
temp: 0
}
}
])
This will merge that three fields into one Temperature taking first not empty value. Additionally if you want to update your collection, you can add $out as a last aggregation stage like { $out: col } but keep in mind that it will entirely replace your source collection.
I think mongodb supports regular expression but they are meant to search datas, not to insert them based on fieldname matches.
I am quite sure you shall use some kind of facade in front of your database to achieve that.

mongodb check regex on fields from one collection to all fields in other collection

After digging google and SO for a week I've ended up asking the question here. Suppose there are two collections,
UsersCollection:
[
{...
name:"James"
userregex: "a|regex|str|here"
},
{...
name:"James"
userregex: "another|regex|string|there"
},
...
]
PostCollection:
[
{...
title:"a string here ..."
},
{...
title: "another string here ..."
},
...
]
I need to get all users whose userregex will match any post.title(Need user_id, post_id groups or something similar).
What I've tried so far:
1. Get all users in collection, run regex on all products, works but too dirty! it'll have to execute a query for each user
2. Same as above, but using a foreach in Mongo query, it's the same as above but only Database layer instead of application layer
I searched alot for available methods such as aggregations, upwind etc with no luck.
So is it possible to do this in Mongo? Should i change my database type? if yes what type would be good? performance is my first priority. Thanks
It is not possible to reference the regex field stored in the document in the regex operator inside match expression.
So it can't be done in mongo side with current structure.
$lookup works well with equality condition. So one alternative ( similar to what Nic suggested ) would be update your post collection to include an extra field called keywords ( array of keyword values it can be searched on ) for each title.
db.users.aggregate([
{$lookup: {
from: "posts",
localField: "userregex",
foreignField: "keywords",
as: "posts"
}
}
])
The above query will do something like this (works from 3.4).
keywords: { $in: [ userregex.elem1, userregex.elem2, ... ] }.
From the docs
If the field holds an array, then the $in operator selects the
documents whose field holds an array that contains at least one
element that matches a value in the specified array (e.g. ,
, etc.)
It looks like earlier versions ( tested on 3.2 ) will only match if array have same order, values and length of arrays is same.
Sample Input:
Users
db.users.insertMany([
{
"name": "James",
"userregex": [
"another",
"here"
]
},
{
"name": "John",
"userregex": [
"another",
"string"
]
}
])
Posts
db.posts.insertMany([
{
"title": "a string here",
"keyword": [
"here"
]
},
{
"title": "another string here",
"keywords": [
"another",
"here"
]
},
{
"title": "one string here",
"keywords": [
"string"
]
}
])
Sample Output:
[
{
"name": "James",
"userregex": [
"another",
"here"
],
"posts": [
{
"title": "another string here",
"keywords": [
"another",
"here"
]
},
{
"title": "a string here",
"keywords": [
"here"
]
}
]
},
{
"name": "John",
"userregex": [
"another",
"string"
],
"posts": [
{
"title": "another string here",
"keywords": [
"another",
"here"
]
},
{
"title": "one string here",
"keywords": [
"string"
]
}
]
}
]
MongoDB is good for your use case but you need to use a approach different from current one. Since you are only concerned about any title matching any post, you can store the last results of such a match. Below is a example code
db.users.find({last_post_id: {$exists: 0}}).forEach(
function(row) {
var regex = new RegExp(row['userregex']);
var found = db.post_collection.findOne({title: regex});
if (found) {
post_id = found["post_id"];
db.users.updateOne({
user_id: row["user_id"]
}, {
$set :{ last_post_id: post_id}
});
}
}
)
What it does is that only filters users which don't have last_post_id set, searches post records for that and sets the last_post_id if a record is found. So after running this, you can return the results like
db.users.find({last_post_id: {$exists: 1}}, {user_id:1, last_post_id:1, _id:0})
The only thing you need to be concerned about is a edit/delete to an existing post. So after every edit/delete, you should just run below, so that all matches for that post id are run again.
post_id_changed = 1
db.users.updateMany({last_post_id: post_id_changed}, {$unset: {last_post_id: 1}})
This will make sure that next time you run the update these users are processed again. The approach does have one drawback that for every user without a matching title, the query for such users would run again and again. Though you can workaround that by using some timestamps or post count check
Also you should make to sure to put index on post_collection.title
I was thinking that if you pre-tokenized your post titles like this:
{
"_id": ...
"title": "Another string there",
"keywords": [
"another",
"string",
"there"
]
}
but unfortunately $lookup requires that foreignField is a single element, so my idea of something like this will not work :( But maybe it will give you another idea?
db.Post.aggregate([
{$lookup: {
from: "Users",
localField: "keywords",
foreignField: "keywords",
as: "users"
}
},
]))

spring-data-mongodb document design options - array vs dynamic field (i.e. map)

I am designing a document structure and use spring-data-mongodb to access it. The document structure is to store device profile. Each device contains modules of different types. The device can contains multiple modules of the same type. Module types are dynamic as new type of modules are created sometimes.
Please note: I try not to write custom queries to avoid boilerplate code. But, some custom queries should be fine.
I come out with two designs:
the first one use dynamic field (i.e. map). Semantics is better but seems harder to query/update using spring-data-mongodb.
{
deviceId: "12345",
instanceTypeMap: {
"type1": {
moduleMap: {
"1": {field1: "value",field2: "value"},
"2": {field1: "value",field2: "value"}
}
},
"type2": {
moduleMap: {
"30": {fielda: "value",fieldb: "value"},
"45": {fielda: "value",fieldb: "value"}
}
}
}
the second one use array and query/update seems more in-line with spring-data-mongodb.
{
deviceId: "12345",
allInstances: [
{
type: 1,
modules: [
{
id: 1,
field1: "value",
field2: "value"
},
{
id: 2,
field1: "value",
field2: "value"
}
]
},
{
type: 2,
modules: [
{
id: 30,
fielda: "value",
fieldb: "value"
},
{
id: 45,
fielda: "value",
fieldb: "value"
}
]
}
]
}
I am inclined to use array. Is it better to use array instead of dynamic field with spring-data-mongodb. I did some search on-line and found people mentioned that query for key (i.e. in map) is not as easy in spring-data-mongodb. Is that a correct statement? Do I miss anything? Thank you in advance.
I ended up with the design as below. I use one device-instance-type per document. Because, in some scenario,
updates are done on many modules of the same instance type. Those updates can be aggregated as just one database update.
The redundant "moduleId" field is also added for query purpose.
{
deviceId: "12345",
instanceTypeId: "type1",
moduleMap: {
"1": {
moduleId: "1",
field1: "value",
field2: "value"
},
"2": {
moduleId: "2",
field1: "value",
field2: "value"
}
}
}
Now, I can use spring-data-mongodb's query:
findByDeviceId("12345");
findByDeviceIdAndInstanceTypeId("12345","type1");
findByDeviceIdAndInstanceTypeIdAndModuleMapModuleId("12345","type1","1");