Create document clustering based on the text of the document - cluster-analysis

In Elasticsearch, is possible to group documents that share the most similar texts, without giving an initial query to compare to?
I know is possible to query and get MLT("more like this document") but, is possible to cluster documents within an index according to a field values?
For instance:
document 1: The quick brown fox jumps over the lazy dog
document 2: Barcelona is a great city
document 3: The fast orange fox jumps over the lazy dog
document 4: Lotus loft Room - Bear Mountains Neighbourhood
document 5: I do not like to eat fish
document 6: "Lotus Loft" Condo From $160.00 CAD/night, sleeps up to 4
document 7: Lotus Loft
Now, perform some kind of aggregation that, without giving a search query, it can group:
Group 1: document 1 and document 3
Group 2: document 2
Group 3: document 4 and document 6 and document 7
Group 4: document 5
OR
Please just let me know other ways to find the different document clustering e.g using Apache Spark, KNN, Unsupervised learning way or any other algorithm to find the near-duplicate documents or cluster similar documents?
I just want to cluster my document based on country, city, latlng, property name or description etc. field of my elasticsearch documents.
Basically I want to know-
How to make clusters of similar documents(e.g json/csv) or find duplicate documents using python text analysis/unsupervised learning with KNN/ pyspark with MLIB or any other document clustering algorithms? give me some hint/open source projects or any other resource links. I just need some concrete examples or tutorials for this task

Yes, it's possible. There is an ElasticSearch plugin named Carrot2. The clustering plugin automatically group together similar "documents" and assign human-readable labels to these groups, and it has 4 built-in clustering algorithms (3 free, 1 license required). You can make a match_all query if you want to cluster all documents in an ES index.
Here is my ES 6.6.2 client code example for clustering in Python 3:
import json
import requests
REQUEST_URL = 'http://localhost:9200/b2c_index/_search_with_clusters'
HEADER = {'Content-Type':'application/json; charset=utf-8'}
requestDict = {
"search_request": {
"_source": [ "title", "content", "lang" ],
"query": {"match_all":{}},
"size": 100
},
"query_hint": "",
"field_mapping": {
"title": ["_source.title"],
"content": ["_source.content"],
"language": ["_source.lang"],
}
}
resp = requests.post(REQUEST_URL, data=json.dumps(requestDict), headers=HEADER)
print(resp.json())
By the way, Solr also uses Carrot2 to cluster documents.

Related

storing millions of values again a key database modelling

I want to store all tags against the document in which they appeared and make it searchable by some other service/client. Scale:
10 Billion search query per day
10 Million New tags CRUD per day (deleted from doc or appended to doc)
So suppose
"hello" appeared in 10 million documents.
So when a user does the query for "hello", I want to return the list of document_ids in which it occurred.
What should I do for the data modelling for the same?
option 1:
use key: value NoSQL like dynamodb
key: "hello"
value: [doc_id1, doc_id2, .......]
Issues: whenever there is a change in any document related to this tag, we have to read the real value and make the changes.
option 2:
storing in individual rows and using something like MongoDB
"hello": doc_id1
"hello": doc_id2
Issue: suppose when doc_id122 removes the "hello" tag then we will have to fetch all entries to delete this one as our database will be shared on tag_name
option3 : column based (e.g Cassandra)
option 4: elastic search
An extensive requirement for the same is: that
we want to support the autosuggest on the tag in our tag service.
return according to some ranking (we can't return 1 million in the first go) so return the first 50 most popular documents (can be most viewed, most clapped). I think elastic search internally gives the option to rank documents higher based on Tg-IDF algorithm

MongoDB Querying Large Datasets

Lets say I have simple document structure like:
{
"item": {
"name": "Skittles",
"category": "Candies & Snacks"
}
}
On my search page, whenever user searches for product name, I want to have a filter options by category.
Since categories can be many (like 50 types) I cannot display all of the checkboxes on the sidebar beside the search results. I want to only show those which have products associated with it in the results. So if none of the products in search result have a category, then do not show that category option.
Now, the item search by name itself is paginated. I only show 30 items in a page. And we have tens of thousands of items in our database.
I can search and retrieve all items from all pages, then parse the categories. But if i retrieve tens of thousands of items in 1 page, it would be really slow.
Is there a way to optimize this query?
You can use different approaches based on your workflow and see what works the best in your situation. Some good candidate for the solution are
Use distinct prior to running the query on large dataset
Use Aggregation Pipeline as #Lucia suggested
[{$group: { _id: "$item.category" }}]
Use another datastore(either redis or mongo itselff) to store intelligence on categories
Finally based on the approach you choose and the inflow of requests for filters, you may want to consider indexing some fields
P.S. You're right about how aggregation works, unless you have a match filter as first stage, it will fetch all the documents and then applies the next stage.

How to query all documents, filter for a specific field and return the value for each document in Elasticsearch?

I'm currently running an Elasticsearch instance which is synchronizing from a MongoDB via river. The MongoDB contains entries like this:
{field1: "value1", field2: "value2", cars: ["BMW", "Ford", "Porsche"]}
Not every entry in Mongo does have a cars field.
Now I want to create an ElasticSearch query which is searching over every document and return just the cars field from every single document indexed in Elasticsearch.
Is it even possible? Elasticsearch must touch every single document to return the cars field. Maybe querying with Mongo is just easier and as fast as Elasticsearch. What do you think?
The following query POSTed to hostname:9200/_search should get you started:
{
"filter": {
"exists": {
"field": "cars"
}
},
"fields": ["cars"]
}
The filter clause limits the results to documents with a cars field.
The fields clause says to only return the cars field. If you wanted the entire document returned, you would leave this section out.
References:
https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#_response_filtering
Make elasticsearch only return certain fields?
Elasticsearch (from my understanding) is not intended to be a SSoT database. It is very good at text searching, and analytics aggregations, but it isn't necessarily intended to be your primary database.
However, your use case isn't necessarily non performant in elasticsearch, it sounds like you just want to filter for your cars field, which you can do as documented here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-fields.html
Lastly, I would actually venture that elasticsearch is faster than mongo in this case (assuming that the cars field is NOT indexed and elasticsearch is, which is their respective defaults), since you probably want to filter out the case in which the cars field is not set.
tl;dr elasticsearch isn't intended for your particularly use-case, but it probably is faster than mongo assuming you filter out the cars field being 'missing'

Data model for geo spatial data and multiple queries

I have some mongodb object let's call it place which contains geo information, look at the example:
{
"_id": "234235425e3g33424".
"geo": {
"lon": 12.23456,
"lat": 34.23322
}
"some_field": "value"
}
With every place, a list of features is associated with:
{
"_id": "2334sgfgsr435d",
"place_id": "234235425e3g33424",
"feature_field" : "some_value"
}
As you see features are linked to places thanks to place_id field. Now I would like to find: list of features connected with nearest places. But I would like also add search contition on place.some_field and feature.feature_field. And what is important I would like to limit results.
Now I am using such approach:
I query on places with condition on geo and some_filed
I query on features with condition on feature_field and place_id (limit only to ones found in 1.)
I limit results in my application code
My question is: is there better approach to such task? Now I cannot use mongo limit() function, as when I do it on places I can end with too few results as I need to make second query. I cannot limit() on second query as results will come up with random order, and I would like to sort it by distance.
I know I can put data into one document, but I presume that list of features will be long and I can exceed BSON size limit.
Running out of 16mb for just the features seems unlikely... but it's possible. I don't think you realize how much 16mb is, so do the maths before assuming anything!
In any case, with MongoDB you can not do a query with fields from two collections. A query always deals with one specific collection only. I have done a very similar thing than what you have here though, which i've described in an article: http://derickrethans.nl/indexing-free-tags.html — have a look at that for some more inspiration.

Mongodb map reduce across 2 collection

Let say we have user and post collection. In post collection, vote store the user name as a key.
db.user.insert({name:'a', age:12});
db.user.insert({name:'b', age:12});
db.user.insert({name:'c', age:22});
db.user.insert({name:'d', age:22});
db.post.insert({Title:'Title1', vote:[a]});
db.post.insert({Title:'Title2', vote:[a,b]});
db.post.insert({Title:'Title3', vote:[a,b,c]});
db.post.insert({Title:'Title4', vote:[a,b,c,d]});
We would like to group by the post.Title and find out the count of vote in different user age.
> {_id:'Title1', value:{ ages:[{age:12, Count:1},{age:22, Count:0}]} }
> {_id:'Title2', value:{ ages:[{age:12, Count:2},{age:22, Count:0}]} }
> {_id:'Title3', value:{ ages:[{age:12, Count:2},{age:22, Count:1}]} }
> {_id:'Title4', value:{ ages:[{age:12, Count:2},{age:22, Count:2}]} }
I have searched through and doesn't find a way to access 2 collection in mongodb mapreduce.
Could it be possible to achieve in re-reduce?
I know it is much simple to embedded the user document in post, but it is not a nice way to do as the real user document have many properties. If we include the simplify version of user document, it will limit the dimension of analysis.
{Title:'Title1', vote:[{name:'a', age:12}]}
MongoDB does not have a multi-collection Map / Reduce. MongoDB does not have any JOIN syntax and may not be very good for ad-hoc joins. You will need to denormalize this data in some way.
You have a few options:
Option #1: Embed the age with the vote.
{Title:'Title1', vote:[{name:'a', age:12}]}
Option #2: Keep a counter of the ages
{Title:'Title1', vote:[a, b], age: { "12" : 1, "22" : 1 }}
Option #3: Do a "manual" join
Your last option is to write script/code that does a for loop over both collections and merges the data correctly.
So you would loop over post and output a collection with the title and the list of votes. Then you would loop through the new collection and update the ages by looking up each user.
My suggestion
Go with #1 or #2.
Instead of
{name:'a', age:12}
It is easier to add a new field to user document and maintain it in each vote update.Of course, you can enjoy to use map reduce to analysis your data.
{name:'a', age:12, voteTitle:["Title1","Title2","Title3","Title4"]}