Mongodb multilingual search: Which schema is better for faster search results - nested or having language specific fields directly - mongodb

We are implementing fuzzy search on product using Atlas search index and for querying, we are using Mongoose. The kind of search we want includes multilingual searching and for this we are using following schema for the product -
{
language: "de",
name: String,
description: String,
translation: {
en: {
name: String,
description: String
},
fr: {
name: String,
description: String
}
}
}
Will above schema be a good fit considering search performance as there will be thousands or more hits for reading the data. Going forward, the search queries may go up to millions as it is an e-commerce system. Having nested structure will be good for querying or there are another options we can opt for,
Having language specific fields directly with shorthand specified for language:
{
name_de: String,
description_de: String,
name_en: String,
description_en: String,
name_fr: String,
description_fr: String
}
Having language specific fields nested with the field name as the key
{
name: {
en: String,
de: String,
fr: String
},
description: {
en: String,
de: String,
fr: String
}
}
Having language as the key and field names nested in that object:
{
en: {
name: String,
description: String
},
fr: {
name: String,
description: String
}
}
Or any other schema that will be suitable for this scenario?
Search will be performed on the basis of language selected by the user. So, if a user opts for French as his preferred language, we will look for the keyword typed by user in French language.
P.S. - There are more fields than just name and description which are also language specific.

I would opt for option one because of the limited support for nested fields in Atlas Search, though option 2 would work as well. Here is how I would define the index in your case:
{
"mappings": {
"fields": {
"name_de": {
"analyzer": "lucene.german",
"type": "string"
},
"name_fr": {
"analyzer": "lucene.french",
"type": "string"
},
"name_en": {
"analyzer": "lucene.english",
"type": "string"
},
"description_de": {
"analyzer": "lucene.german",
"type": "string"
},
"description_fr": {
"analyzer": "lucene.french",
"type": "string"
},
"description_en": {
"analyzer": "lucene.english",
"type": "string"
}
}
}
}
This way, you can the benefits of highlighting, which could be extra helpful if your description field is long. You will also get better stop word support and diacritics out of the box. If you have any trouble, let me know here and I will help.

Related

Specifying several languages for a text Index MongoDB without language field

I have a collection with documents in the following format:
{
_id: 1,
name: {
ru: "Name in Russian",
en: "Name in English",
},
description: {
ru: "Description in Russian",
en: "Description in English",
}
}
I want to create a text Index for a collection where the fields name.ru and description.ru would use Russian, and the fields name.en and description.en would use English as the default language.
I've read documentation suggesting that subdocuments should use the language field to define a language other than the default for the Index, but that doesn't work in my case.
What can be done in my situation?

Some documents not appear in atlas-search when query by few letters

I have a collection. The document structure is,
{
model: {
name: 'string name'
}
}
I have enabled atlas search, Also created a search index for model.name field. Search works fine, But the only issue is couldn't get results for very minimal query letters.
Example:
I have a document,
{
model: {
name: "space1duplicate"
}
}
If I query space, I couldn't get the result.
{
index: 'search_index',
compound: {
must: [
{
text: {
query: 'space',
path: 'model.name'
}
}
]
}
}
But If I query space1duplica, It returns the result.
During indexing, full text search engine tokenizes the input by splitting up text into searchable chunks. Check out the relevant section in the documentation.
By default Atlas Search does not split words by digits, but if you need that, try to define a custom analyzer with the regex tokenizer and use it for your field:
{
"mappings": {
"dynamic": false,
"fields": {
"name": [
{
"analyzer": "digitSplitter",
"type": "string"
}
]
}
},
"analyzers": [
{
"charFilters": [],
"name": "digitSplitter",
"tokenFilters": [],
"tokenizer": {
"pattern": "[0-9]+",
"type": "regexSplit"
}
}
]
}
Also note that you can use multiple analyzers for string fields, if needed.
Atlas search uses Lucene to do the job. Documentation on mongodb site is mostly focused on mongo specific syntax to pass the query to Lucene and might be a bit confusing if you are not familiar with its query language.
First of all, there are number of tokenizers and analizers available, each serve specific purpose. You really need include index definition when you ask quetions about atlas search.
Default tokeniser uses word separators to build the index, then removes endings to store stems, again depending on language, English by default.
So in order to find "space1duplicate" by beginning of the word you can use "autocomplete" analizer with nGram tokens. The index should be created as following:
{
"mappings": {
"dynamic": false,
"fields": {
"name": {
"tokenization": "nGram",
"type": "autocomplete"
}
}
},
"storedSource": {
"include": [
"name"
]
}
}
Once it's indexed (you may need to wait a bit you you have larger dataset), you can find the document with following search:
{
index: 'search_index',
compound: {
must: [
{
autocomplete: {
query: 'spa',
path: 'name'
}
}
]
}
}

What is the indexing strategy for a variable query?

The most common use case for this would probably be a user table, with name, lname, email, phone.
I might search for name contains "paul", email contains 2#yahoo"
I might search for phone = 01234567890
I might search for email = "foo#bar.com"
It is my understanding that in a mongo index works in order. So an index that looks like
name:1, lname:1, email:1, phone:1 wouldn't work for any of the above queries?
What's the best indexing strategy to account for search tables like this?
so, paul you will need to create an index definition before you can run the query. Creating your first search index definition in the collection view in Atlas Data Explorer can be tricky.
Here's what I would recommend for an index definition based on those docs:
{
"mappings": {
"fields": {
"email": {
"analyzer": "lucene.keyword",
"type": "string"
},
"phone": {
"analyzer": "lucene.keyword",
"type": "string"
},
"name": {
"analyzer": "lucene.keyword",
"type": "string"
},
"lname": {
"analyzer": "lucene.keyword",
"type": "string"
}
}
}
}
Here is what I would recommend for a contains-style query on the email and name fields:
{
$search: {
index: 'default',
compound: {
must: [{
wildcard: {
query: '*paul*',
path: 'name'
}
},{
wildcard: {
query: '*2#yahoo*',
path: 'email'
}
}]
}
}
}
Should be a lightning fast query, even for a large index, and as one of multiple clauses as you have described. Let me know if you have any more trouble. There's lot of features like highlighting that should be helpful as well. Note that this query is a single clause. If you want multiple clauses as you have described, embed this clause in a compound operator as seen here.

JSON and object inheritance

We are trying to move from SOAP to REST and we stumbled across this issue
Here party can be of type Individual or Organization.
Sample XMLs
<customer>
<details>
<party xsi:type="Individual">
<externalID>ABC123</externalID>
<firstname>John</firstname>
<lastname>Smith</lastname>
</party>
</details>
</customer>
<customer>
<details>
<party xsi:type="Organization">
<externalID>APPLE</externalID>
<organizationName>Apple Inc</organizationName>
<listingName>APPLE</listingName>
</party>
</details>
</customer>
However when we move to JSON representation of the same, we encounter the problem where the inheritance information is lost
JSON Sample
{
"customer": {
"details": {
"party": {
"externalID": "ABC123",
"firstname": "John",
"lastname": "Smith"
}
}
}
}
{
"customer": {
"details": {
"party": {
"externalID": "APPLE",
"organizationName": "Apple Inc",
"listingName": "APPLE"
}
}
}
}
So when we convert the JSON back to Java object using libraries like Gson, we loose the definition of Individual or Organization.
While one of the workaround is to build additional services to retrieve the "details" returning the concrete types (Individual or Organization),
is there any other approach to handle this in JSON?
It's been possible to combine schemas using keywords such as oneOf, allOf, anyOf and get the payload validated since JSON schema v1.0.
https://spacetelescope.github.io/understanding-json-schema/reference/combining.html
However, composition has been enhanced by the keyword discriminator incorporated on OpenAPI (former Swagger) to barely support polymorphism. In OpenAPI 3.0, this support has been enhanced by the addition of oneOf keyword.
Your inheritance could be modeled using a combination of oneOf (for choosing one of the children) and allOf (for combining parent and child).
paths:
/customers:
post:
requestBody:
content:
application/json:
schema:
oneOf:
- $ref: '#/components/schemas/Individual
- $ref: '#/components/schemas/Organization
discriminator:
propertyName: customer_type
responses:
'201':
description: Created
components:
schemas:
Customer:
type: object
required:
- customer_type
- externalID
properties:
customer_type:
type: string
externalID:
type: integer
discriminator:
property_name: customer_type
Individual:
allOf:
- $ref: "#/components/schemas/Customer"
- type: object
- properties:
firtName
type: string
lastName
type: string
Organisation:
allOf:
- $ref: "#/components/schemas/Customer"
- type: object
- properties:
organisationName
type: string
listingName
type: string
https://github.com/OAI/OpenAPI-Specification/blob/master/versions/3.0.0.md#schemaComposition
18/02/2020 Edit:
For generating Java code using Jackson, OpenAPITools has seemingly been fixed by the PR #5120
You can write a custom Gson deserializer and based on existence of organisationName field you create instance of either Organization or Individual class in this deserializar. Something like what is specified here: Gson - deserialization to specific object type based on field value, instead of "type" you can check for the existence of your property.
I found a resource submitted by IBM and felt it may be of some help. It's linked below.
On pg. 27 they convert:
<year xsi:type="xs:positiveInteger">1989</year>
To:
{ "xsi:type" : "xs:positiveInteger", value : 1989 }
Thus, I think your code should convert to:
{
"customer": {
"details": {
"party": {
"xsi:type": "Individual",
"externalID": "ABC123",
"firstname": "John",
"lastname": "Smith"
}
}
}
}
{
"customer": {
"details": {
"party": {
"xsi:type": "Organization",
"externalID": "APPLE",
"organizationName": "Apple Inc",
"listingName": "APPLE"
}
}
}
}
Resource for reference: https://www.w3.org/2011/10/integration-workshop/s/ExperienceswithJSONandXMLTransformations.v08.pdf

Mongo find documents where value of property does not contains a given string

This Meteor server code needs to find all document where food does not contains 'hot' case insensitive.
FoodCol.find({food: /^hot/}); is not cutting it.
So that I need the code to only return {food: 'chicken soup, type: 'soups'} sine it is the only document where the string 'hot' is not found in the property 'food'.
How can it be done? Thanks
{
{
food: 'Hot coffee',
type: 'drink'
}, {
food: 'cake with hot topping',
type: 'cake'
}, {
food: 'chicken soup',
type: 'soups'
}
}
Run the following query, it uses the $not operator which will perform a logical NOT operation on the regex specified and selects the documents that do not match the regex:
FoodCol.find({ "food": { "$not": /hot/i } })