How to search in ElasticSearch the most common word of a single field in a single document? - rest

How to search in ElasticSearch the most common word of a single field in a single document? Lets say I have a document that have a field "pdf_content" of type keyword containing:
"good polite nice good polite good"
I would like a return of
{
word: good,
occurences: 3
},
{
word: polite,
occurences: 2
},
{
word: nice,
occurences: 1
},
How is this possible using ElasticSearch 7.15?
I tried this in the Kibana console:
GET /pdf/_search
{
"aggs": {
"pdf_contents": {
"terms": { "field": "pdf_content" }
}
}
}
But it only returns me the list of PDFs i have indexed.

Have you ever tried term_vector?:
Basically, you can do:
Mappings:
{
"mappings": {
"properties": {
"pdf_content": {
"type": "text",
"term_vector": "with_positions_offsets_payloads"
}
}
}
}
with your sample document:
POST /pdf/_doc/1
{
"pdf_content": "good polite nice good polite good"
}
Then you can do:
GET /pdf/_termvectors/1
{
"fields" : ["pdf_content"],
"offsets" : false,
"payloads" : false,
"positions" : false,
"term_statistics" : false,
"field_statistics" : false
}
If you want to see other information, you can set them to true. Set all to false give you what you want.

Related

Some documents not appear in atlas-search when query by few letters

I have a collection. The document structure is,
{
model: {
name: 'string name'
}
}
I have enabled atlas search, Also created a search index for model.name field. Search works fine, But the only issue is couldn't get results for very minimal query letters.
Example:
I have a document,
{
model: {
name: "space1duplicate"
}
}
If I query space, I couldn't get the result.
{
index: 'search_index',
compound: {
must: [
{
text: {
query: 'space',
path: 'model.name'
}
}
]
}
}
But If I query space1duplica, It returns the result.
During indexing, full text search engine tokenizes the input by splitting up text into searchable chunks. Check out the relevant section in the documentation.
By default Atlas Search does not split words by digits, but if you need that, try to define a custom analyzer with the regex tokenizer and use it for your field:
{
"mappings": {
"dynamic": false,
"fields": {
"name": [
{
"analyzer": "digitSplitter",
"type": "string"
}
]
}
},
"analyzers": [
{
"charFilters": [],
"name": "digitSplitter",
"tokenFilters": [],
"tokenizer": {
"pattern": "[0-9]+",
"type": "regexSplit"
}
}
]
}
Also note that you can use multiple analyzers for string fields, if needed.
Atlas search uses Lucene to do the job. Documentation on mongodb site is mostly focused on mongo specific syntax to pass the query to Lucene and might be a bit confusing if you are not familiar with its query language.
First of all, there are number of tokenizers and analizers available, each serve specific purpose. You really need include index definition when you ask quetions about atlas search.
Default tokeniser uses word separators to build the index, then removes endings to store stems, again depending on language, English by default.
So in order to find "space1duplicate" by beginning of the word you can use "autocomplete" analizer with nGram tokens. The index should be created as following:
{
"mappings": {
"dynamic": false,
"fields": {
"name": {
"tokenization": "nGram",
"type": "autocomplete"
}
}
},
"storedSource": {
"include": [
"name"
]
}
}
Once it's indexed (you may need to wait a bit you you have larger dataset), you can find the document with following search:
{
index: 'search_index',
compound: {
must: [
{
autocomplete: {
query: 'spa',
path: 'name'
}
}
]
}
}

Tricky MongoDB search challenge

I have a tricky mongoDB problem that I have never encountered.
The Documents:
The documents in my collection have a search object containing named keys and array values. The keys are named after one of eight categorys and the corresponding value is an array containing items from that category.
{
_id: "bRtjhGNQ3eNqTiKWa",
/* */
search :{
usage: ["accounting"],
test: ["knowledgetest", "feedback"]
},
test: {
type:"list",
vals: [
{name:'knowledgetest', showName: 'Wissenstest'},
{name:'feedback', showName: '360 Feedback'},
]
},
usage: {
type:"list",
vals: [
{name:'accounting', showName: 'Accounting'},
]
}
},
{
_id: "7bgvegeKZNXkKzuXs",
/* */
search :{
usage: ["recruiting"],
test: ["intelligence", "feedback"]
},
test: {
type:"list",
vals: [
{name:'intelligence', showName: 'Intelligenztest'},
{name:'feedback', showName: '360 Feedback'},
]
},
usage: {
type:"list",
vals: [
{name:'recruiting', showName: 'Recruiting'},
]
}
},
The Query
The query is an object containing the same category - keys and array - values.
{
usage: ["accounting", "assessment"],
test : ["feedback"]
}
The desired outcome
If the query is empty, I want all documents.
If the query has one category and any number of items, I want all the documents that have all of the items in the specified category.
If the query has more then one category, I want all the documents that have all of the items in all of the specified categorys.
My tries
I tried all kinds of variations of:
XX.find({
'search': {
"$elemMatch": {
'tool': {
"$in" : ['feedback']
}
}
}
No success.
EDIT
Tried: 'search.test': {$all: (query.test ? query.test : [])} which gives me no results if I have nothing selected; the right documents when I am only looking inside the test category; and nothing when I additionally look inside the usage category.
This is at the heart of my app, thus I historically put up a bounty.
let tools = []
const search = {}
for (var q in query) {
if (query.hasOwnProperty(q)) {
if (query[q]) {
search['search.'+q] = {$all: query[q] }
}
}
}
if (Object.keys(query).length > 0) {
tools = ToolsCollection.find(search).fetch()
} else {
tools = ToolsCollection.find({}).fetch()
}
Works like a charm
What I already hinted at in the comment: your document structure does not support efficient and simple searching. I can only guess the reason, but I suspect that you stick to some relational ideas like "schemas" or "normalization" which just don't make sense for a document database.
Without digging deeper into the problem of modeling, I could imagine something like this for your case:
{
_id: "bRtjhGNQ3eNqTiKWa",
/* */
search :{
usage: ["accounting"],
test: ["knowledgetest", "feedback"]
},
test: {
"knowledgetest" : {
"showName": "Wissenstest"
},
"feedback" : {
"showName": "360 Feedback"
}
},
usage: {
"accounting" : {
"values" : [ "knowledgetest", "feedback" ],
"showName" : "Accounting"
}
}
},
{
_id: "7bgvegeKZNXkKzuXs",
/* */
search : {
usage: ["recruiting"],
test: ["intelligence", "feedback"]
},
test: {
"intelligence" : {
showName: 'Intelligenztest'
},
"feedback" : {
showName: '360 Feedback'
}
},
usage: {
"recruiting" : {
"values" : [ "intelligence", "feedback" ],
"showName" : "Recruiting"
}
}
}
Then, a search for "knowledgetest" and "feedback" in "accounting" would be a simple
{ "usage.accounting.values" : { $all : [ "knowledgetest", "feedback"] } }
which can easily be used multiple times in an and condition:
{
{ "usage.accounting.values" : { $all : [ "knowledgetest", "feedback"] } },
{ "usage.anothercategory.values" : { $all [ "knowledgetest", "assessment" ] } }
}
Even the zero-times-case matches your search requirements, because an and-filter with none of these criteria yields {} which is the find-everything filter expression.
Once more, to make it absolutely clear: when using mongo, forget everything you know as "best practice" from the relational world. What you need to consider is: what are your queries, and how can my document model support these queries in an ideal way.

Exclude some text fields in MongoDB text index

I have a collection that has a text index like this:
db.mycollection.createIndex({ "$**" : "text"},{ "name" : "AllTextIndex"}))
The collection's documents have many text data blocks. I want to exclude some of them in order to not get results that includes text matched from the excluded blocks.
However, I don't want to define each field in the text index like below in order to exclude the, for example, NotTextSearchableBlock block:
db.application.ensureIndex({
"SearchableTextBlock1": "text",
"SearchableTextBlock2": "text",
"SearchableTextBlock3": "text"
})
Here is a document example:
{
"_id": "e0832e2d-6fb3-47d8-af79-08628f1f0d84",
"_class": "com.important.enterprise.acme",
"fields": {
"Organization": "testing"
},
"NotTextSearchableBlock": {
"something": {
"SomethingInside": {
"Text":"no matcheable text"
}
}
},
"SearchableTextBlock1": {
"someKey": "someTextCool"
},
"SearchableTextBlock2": {
"_id": null,
"fields": {
"Status": "SomeText"
}
},
"SearchableTextBlock3": {
"SomeSubBlockArray": [
{
"someText": "Pepe"
}
]
}
}
To answer your question, there is no documented way to exclude certain fields from a text index. (See Text Indexes in mongodb's documention.)
As you already know:
When establishing a text index, you have to identify the specific text field(s) to index. The field must be a string or an array of string elements.
db.mycollection.createIndex({
mystringfield : "text"
mystringarray : "text"
})
You may also index all text found within a collection's documents by using the wildcard specifier $**.
db.mycollection.createIndex(
{ "$**" : "text" },
{ name : "mytextindex" }
)

Elasticsearch: Find substring match

I want to perform both exact word match and partial word/substring match. For example if I search for "men's shaver" then I should be able to find "men's shaver" in the result. But in case case I search for "en's shaver" then also I should be able to find "men's shaver" in the result.
I using following settings and mappings:
Index settings:
PUT /my_index
{
"settings": {
"number_of_shards": 1,
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
}
}
Mappings:
PUT /my_index/my_type/_mapping
{
"my_type": {
"properties": {
"name": {
"type": "string",
"index_analyzer": "autocomplete",
"search_analyzer": "standard"
}
}
}
}
Insert records:
POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "name": "men's shaver" }
{ "index": { "_id": 2 }}
{ "name": "women's shaver" }
Query:
1. To search by exact phrase match --> "men's"
POST /my_index/my_type/_search
{
"query": {
"match": {
"name": "men's"
}
}
}
Above query returns "men's shaver" in the return result.
2. To search by Partial word match --> "en's"
POST /my_index/my_type/_search
{
"query": {
"match": {
"name": "en's"
}
}
}
Above query DOES NOT return anything.
I have also tried following query
POST /my_index/my_type/_search
{
"query": {
"wildcard": {
"name": {
"value": "%en's%"
}
}
}
}
Still not getting anything.
I figured it is because of "edge_ngram" type filter on Index which is not able to find "partial word/sbustring match".
I tried "n-gram" type filter as well but it is slowing down the search alot.
Please suggest me how to achieve both excact phrase match and partial phrase match using same index setting.
To search for partial field matches and exact matches, it will work better if you define the fields as "not analyzed" or as keywords (rather than text), then use a wildcard query.
See also this.
To use a wildcard query, append * on both ends of the string you are searching for:
POST /my_index/my_type/_search
{
"query": {
"wildcard": {
"name": {
"value": "*en's*"
}
}
}
}
To use with case insensitivity, use a custom analyzer with a lowercase filter and keyword tokenizer.
Custom Analyzer:
"custom_analyzer": {
"tokenizer": "keyword",
"filter": ["lowercase"]
}
Make the search string lowercase
If you get search string as AsD: change it to *asd*
The answer given by #BlackPOP will work, but it uses the wildcard approach, which is not preferred as it has a performance issue and if abused can create a huge domino effect (performance issue) in the Elastic cluster.
I have written a detailed blog on partial search/autocomplete covering the latest options available in Elasticsearch as of today (Dec 2020) with performance in mind. For more trade-off information please refer to this answer.
IMHO a better approach will be to use the customized n-gram tokenizer according to use-case, which will have already tokens needed for search term so it will be faster, although it will have a bigger index size, but you size is not that costly and speed will be better with more control on how exactly you want substring search to work.
Also size can be controlled if you are conservative in defining the min and max gram in tokenizer setting.
By searching with any string or substring Use:
query: {
or: [{
match_phrase_prefix: {
name: str
}
}, {
match_phrase_prefix: {
surname: str
}
}]
}
Happy coding with Elastic Search....

How can I find records greater than or equal to a time in MongoDB?

I have a MongoDB document structured like this:
{
"_id": ObjectId("50cf904a07ef604c8cc3d091"),
"lessons": {
"0": {
"lesson_name": "View and Edit Lists",
"release_time": ISODate("2012-12-17T00:00:00Z"),
"requires_anim": false,
"requires_qq": true
},
"1": {
"lesson_name": "Leave a Tip",
"release_time": ISODate("2012-12-18T00:00:00Z"),
"requires_anim": false,
"requires_qq": true
}
}
}
I have a number of such documents. I'd like to get all documents for which the release time of a lesson is greater than or equal to a given time. Here's the query I wrote:
db.lessons.find({"lessons.release_time":{"$gte": ISODate("2012-12-16")}});
But this is not returning any documents. Any ideas on what I'm doing wrong and how to correct it. Thanks.
Here's the result of my testing:
> db.testc.insert( { lessons: [
{release_time: ISODate("2012-12-17T00:00:00Z")},
{release_time: ISODate("2012-12-18T00:00:00Z")}
] } )
> db.testc.find({"lessons.release_time":{"$gte": ISODate("2012-12-16")}})
{ "_id" : ObjectId("50cfa093ab08a4592c73f927"),
"lessons" : [
{ "release_time" : ISODate("2012-12-17T00:00:00Z") },
{ "release_time" : ISODate("2012-12-18T00:00:00Z") }
] }
Your query is fine but, as others have pointed out, most likely your data is not structured as an array.