How to cluster documents based on overlapping identifiers?

How to cluster documents based on overlapping identifiers? - cluster-analysis

I have 3.5M documents and each document has k unique identifiers. I need to cluster documents based on their similarity. Two documents are similar if they have m overlapping identifiers. m < k
If I pick any two documents in the cluster (for cluster size > 1), the must have at-least m-overlapping identifiers.
What is a fast way to do it. Also, I want to minimize the number of clusters.

If I understand you correctly, you are looking for graph clustering, which is a hard problem to solve.
Here is an article about graph clustering, but you might find more/better info if you google for it.
As for "what is the fastest way to do it". It is next to impossible to answer as you don't give any information about the dataset or your environment. However, I do suspect that loading this into a graph database somehow, where some of which has graph clustering features built in, would get you far quite quickly.
For the general procedure to solve this problem, here is some pseudo-code:
define calculate_similarity(doc1, doc2)
score = 0
foreach identifier in doc1.identifiers
score += 1 if doc2.identifiers.contain(identifier)
return score
similarity_double_hash = new hash(default = new Hash)
foreach document1 in all_documents
foreach document2 in all_document
next if document1 == document2
similarity = calculate_similarity(document1,document2)
similarity_double_hash[document1][document2] = similarity
similarity_double_hash[document2][document1] = similarity
Because we now have a "any-to-any" relation in the double-hash, we can find any cluster that a document is in simply by looking at the "m"'s of that document. Any two with the same m number will be in a cluster.
Example of one such group:
define get_groups_from_document(doc, similarity_double_hash)
groups = new hash(default = new list)
foreach sim_score, hash_key in similarity_double_hash[doc]
groups[sim_score].append(hash_key) #Remember, hash_key is the other document
return groups
The groups hash that is returned is a pointer for a value of m to the documents that is part of that group, originating from a document. The other documents are guaranteed to have a score to the other documents in the group that are at least m. It is not guaranteed to be exactly m.
If you start from another document, the same value of m can, and probably will, have other documents in the list.
If you want to get the largest clusters for a given m, then you must figure out which document to originate from to get the largest cluster. Also, a document can be part of multiple clusters. If you do not want that, then you are back at the beginning with the hard problem of graph clustering.
To find the largest groups for each given m, you can do this:
all_groups = new hash
foreach document in all_documents
all_groups[document] = get_groups_from_document(document, similarity_double_hash)
max_groups = new hash
foreach group in all_groups
foreach score, document_list in group
if max_groups[score].length < document_list.length
max_groups[score] = document_list
foreach score, document_list in max_groups
print "Largest group for " + score + " is " + document_list.to_string
Now you have a fine list of the largest groups for any given m, but as I said, documents can be in multiple lists and a "m" group here is really "m-or-greater", not "exactly m".

Related

Swift: What is the best way to quickly search through a huge database to find relevant result?

I'm trying to implement a search algorithm that can search through hundreds of thousands of products and display the most relevant searches.
My current process is
Get user's input and filter out prepositions and punctuations to arrive at keywords
Break keywords into and array
For each of the keywords find all the products that contains the keyword in the product description and add all the product to a RawProductDictionary.
Calculate Levenshtein Distance number between the Keywords and each product description.
Create an array of product based the Levenshtein Distance number.
this question builds on top of this question
Swift: How can the dictionary values be arranged based on each item's Levenshtein Distance number
this is my Levenshtein Distance function
func levenshteinDist(test: String, key: String) -> Int {
let empty = Array<Int>(repeating:0, count: key.count)
var last = [Int](0...key.count)
for (i, testLetter) in test.enumerated() {
var cur = [i + 1] + empty
for (j, keyLetter) in key.enumerated() {
cur[j + 1] = testLetter == keyLetter ? last[j] : min(last[j], last[j + 1], cur[j]) + 1
}
last = cur
}
return last.last!
}
This is the function that implements step 5
func getProductData(){
Global.displayProductArry = []
var pIndexVsLevNum = [String : Int]()
for product0 in Global.RawSearchDict{
let generatedString = product0.value.name.uppercased()
let productIndex = product0.key
let relevanceNum = levenshteinDist(test: generatedString, key: self.userWordSearch)
pIndexVsLevNum[productIndex] = relevanceNum
}
print(pIndexVsLevNum)
Global.displayProductArry = []
for (k,v) in (Array(pIndexVsLevNum).sorted {$0.1 < $1.1}) {
print("\(k):\(v)")
Global.displayProductArry.append(Global.RawSearchDict[k]!)
}
}
The code works but the products are not that relevant to the user input
Levenshtein Distance number is not always indicative of relevance. Products with shorter description are usually disadvantaged and missed.
what is the best way to implement searching through hundreds of thousand of products quickly in swift?

I believe you are looking for Full-Text Search.
You could use existing tools for that, rather than creating your own information retrieval process.
Looks like SQLite can give you that:
See: https://medium.com/flawless-app-stories/how-to-use-full-text-search-on-ios-7cc4553df0e0

According to Wikipedia:
Informally, the Levenshtein distance between two words is the minimum
number of single-character edits (insertions, deletions or
substitutions) required to change one word into the other.
You should be using Levenshtein distance to compute individual words with each other, not entire product descriptions with a single word. The reason you would compare individual words with each other is to determine if the user has made a typo, and determine if he actually meant to type something else. Hence the first part of your problem is to first try to clean up the users query.
First check for perfect matches against your keyword database
For words which do not match perfectly, run Levenshtein to create a list of words most closely matching.
Lets stepback for a moment and look at the big picture:
Simply using Levenshtein distance by itself, is not the best way to determine which is the most relevant product by comparing with the entire product description, since normally the product description will be much much larger than a users query and will describe a variety of features. Let us assume that the words are correctly spelled and forget spellchecking for a moment so we can focus on relevancy.
You will have to use a combination of techniques to determine which is the most relevant document to display:
First, create a tf-idf database to determine how important each word is in a product description. Words like and, is, the etc are very common and usually will not help you determine which document is most relevant to a users query.
The longer a product description is, the more often a word is likely to occur. This is why we need to compute inverse document frequency, to determine how rare a word is across the entire database of documents.
By creating tf-idf database, you can rank the most important words in a product description, as well as determine how common a word is across all documents. This will help you assign weights to the value of each word.
A high weight in tf-idf is reached by a high term frequency in a given
document, and a low document frequency of the term in the whole
collection of documents.
Hence, for each word in a query, you must compute the relevancy score for all documents in your product description database. This should ideally be done in advance, so that you can quickly retrieve results. There are multiple ways you can compute TF-IDF, so based on your ability, select one option, and compute for TF-IDF for every unique word in your document.
Now how will you use TF-IDF to produce relevant results ?
Here is an example:
Query: "Chocolate Butter Pancakes"
You should have already computed the TF, and IDFs for each of the three words in the query. A simplistic formula for computing relevance is:
Simplistic Product Description Score: TF-IDF(Chocolate) + TF-IDF(Butter) + TF-IDF(Pancakes)
Compute the Product Description score for every single product description (for the words in the query), and sort the results from highest score to lowest score to get the most relevant result.
The above example, is a very simple explanation of how to compute relevancy, since the question you asked is actually a huge topic. To improve relevancy, you would have to do several additional things:
Stemming, Lemmatization and other Text Normalization techniques prior to computing TF-IDF of your product descriptions.
Likewise, you may need to do the same for you search queries.
As you can imagine, the above algorithm to provide sorted relevant results would perform poorly if you have a large database of product descriptions. To improve performance, you may have to do a number of things:
Cache the results of previous queries. If new products are not added / removed, and the product descriptions do not change often, then it becomes much easier.
If descriptions change, or products are added / removed, you need to compute TF-IDF for the entire database again to get the more relevant results. You will also need to trash your previous cache and cache new results instead. This means that you would have to periodically recompute TF-IDF for your entire database, depending on how it often it is updated.
As you can see, even this simple example is already starting to get complicated to implement, even though we haven't even started talking about more advanced techniques in Natural Language processing, even things as simple as how to consider the usage of synonyms in a document.
Hence this question is simply too broad for anyone to provide an answer on stackoverflow.
Rather than implement a solution yourself, I would recommend searching for a ready-made solution and incorporating it in your project instead. Search is a common feature nowadays, and as there are many solutions available for different platforms, perhaps you could offload search to a web-service, so you are not limited by having to use Swift - and can then just use ready-made solution like Solr, Lucene, Elastic Search etc..

Possible to retrieve multiple random, non-sequential documents from MongoDB?

I'd like to retrieve a random set of documents from a MongoDB database. So far after lots of Googling, I've only seen ways to retrieve one random document OR a set of documents starting at a random skip position but where the documents are still sequential.
I've tried mongoose-simple-random, and unfortunately it doesn't retrieve a "true" random set. What it does is skip to a random position and then retrieve n documents from that position.
Instead, I'd like to retrieve a random set like MySQL does using one query (or a minimal amount of queries), and I need this list to be random every time. I need this to be efficient -- relatively on par with such a query with MySQL. I want to reproduce the following but in MongoDB:
SELECT * FROM products ORDER BY rand() LIMIT 50;
Is this possible? I'm using Mongoose, but an example with any adapter -- or even a straight MongoDB query -- is cool.
I've seen one method of adding a field to each document, generating a random value for each field, and using {rand: {$gte:rand()}} each query we want randomized. But, my concern is that two queries could theoretically return the same set.

You may do two requests, but in an efficient way :
Your first request just gets the list of all "_id" of document of your collections. Be sure to use a mongo projection db.products.find({}, { '_id' : 1 }).
You have a list of "_id", just pick N randomly from the list.
Do a second query using the $in operator.
What is especially important is that your first query is fully supported by an index (because it's "_id"). This index is likely fully in memory (else you'd probably have performance problems). So, only the index is read while running the first query, and it's incredibly fast.
Although the second query means reading actual documents, the index will help a lot.
If you can do things this way, you should try.

I don't think MySQL ORDER BY rand() is particularly efficient - as I understand it, it essentially assigns a random number to each row, then sorts the table on this random number column and returns the top N results.
If you're willing to accept some overhead on your inserts to the collection, you can reduce the problem to generating N random integers in a range. Add a counter field to each document: each document will be assigned a unique positive integer, sequentially. It doesn't matter what document gets what number, as long as the assignment is unique and the numbers are sequential, and you either don't delete documents or you complicate the counter document scheme to handle holes. You can do this by making your inserts two-step. In a separate counter collection, keep a document with the first number that hasn't been used for the counter. When an insert occurs, first findAndModify the counter document to retrieve the next counter value and increment the counter value atomically. Then insert the new document with the counter value. To find N random values, find the max counter value, then generate N distinct random numbers in the range defined by the max counter, then use $in to retrieve the documents. Most languages should have random libraries that will handle generating the N random integers in a range.

Mongo aggeregation limit while grouping

I have a collection which im aggregating and grouping by the field "type". The final result should be just maximum of five documents in each type. But if i limit before group only five first docs will be grouped. if i limit after the group the first five types will return.
is there a way to do this without doing find() for each type , limiting to 5 and merging all the results ?

If you can use C# (which according to my quick google-search about mongodb you do), you can do this with one of the GroupBy's which have an "ResultSelector-Function", like this:
var groups = Enumerable.Range(0, 1000).
GroupBy(
x => x/10,
(key, elements) => new { Key = key, Elements = elements.Take(5) }
);
About the speed of this code - I believe the group is completely build before the result-selector is instantiated - so a custom foreach over the input sequence and building the groups by hand might be faster (if you can somehow determine when you are done)
P.S.: On second thought - I doubt my answer is the one you want. I had a look at the mongo-DB documentation, and "map" in combination with a suitable "reduce" function might be exactly what you want.

Maintaining order of mongodb collection

I have a collection that will have many documents (maybe millions). When a user inserts a new document, I would like to have a field that maintains the "order" of the data that I can index. For example, if one field is time, in this format "1352392957.46516", if I have three documents, the first with time: 1352392957.46516 and the second with time: 1352392957.48516 (20ms later) and the third with 1352392957.49516 (10ms later) I would like to have an another field where the first document would have 0, and the second would be 1, the third 2 and so on.
The reason I want this is so that I can index that field, then when I do a find I can do an efficient $mod operation to down sample the data. So for example, if I have a million docs, and I only want 1000 of them evenly spaced, I could do a $mod [1000, 0] on the integer field.
The reason I could not do that on the Time field is because they may not be perfectly spaced, or might be all even or odd so the mod would not work. So the separate integer field would keep the order in a linearly increasing fashion.
Also, you should be able to insert documents anywhere in the collection, so all subsequent fields would need to be updated.
Is there a way to do this automatically? Or would I have to implement this? Or is there a more efficient way of doing what I am describing?

It is well beyond "slower inserts" if you are updating several million documents for a single insert - this approach makes your entire collection the active working set. Similarly, in order to do the $mod comparison with a key value, you will have to compare every key value in the index.
Given your requirement for a sorted sampling order, I'm not sure there is a more efficient preaggregation approach you can take.
I would use skip() and limit() to fetch a random document. The skip() command will be scanning from the beginning of the index to skip over unwanted documents each time, but if you have enough RAM to keep the index in memory the performance should be acceptable:
// Add an index on time field
db.data.ensureIndex({'time':1})
// Count number of documents
var dc = db.data.count()
// Iterate and sample every 1000 docs
var i = 0; var sampleSize = 1000; var results = [];
while (i < dc) {
results.push(db.data.find().sort({time:1}).skip(i).limit(1)[0]);
i += sampleSize;
}
// Result array of sampled docs
printjson(results);

How can I tell if there are more results from a query in MongoDB?

Is there a preferred way to query mongo with a limit and know whether there will be more results if I query for the next page with skip/limit?
What I have been doing is asking for one more document than I need, slicing it off the end, and using the existence of that extra document to know whether another query will give at least one more result.
n = 10
docs = db.documents.find({'foo': 'bar'}).limit(n+1)
more = docs.count() > n
docs = docs[:n]
I feel like this is a common use case (knowing whether or not to show an "more results" button on a website) and I feel stupid with my current solution.

MongoDB has tailable cursors, which allow you to reuse a cursor after all data has been returned. It would look something like this:
n = 10
docs = db.documents.find({"foo": "bar"}).limit(n)
more = docs.hasNext()
Note that there are only 10 documents retrieved, but the cursor can be inspected to determine if there are more objects available. The only problem is that tailable cursors can only be used on capped collections.
The above can also be used with a regular cursor, but you'd have to query for n + 1 documents. This is basically the same solution as you're using now. You have to use size() though, as that takes the skip and limit modifiers into account.
n = 10
docs = db.documents.find({"foo": "bar"}).limit(n + 1)
more = db.size() > n
I'm not familiar with PyMongo, so I don't know this for sure, but there's a possibility that this solution sends n + 1 full documents to your application, rather than the required n, resulting in minor bandwidth overhead. If that's the case, you may want to create a server-side function that does the same, but only returns an object containing n documents in an array, and a flag that indicates if an n + 1th document is available.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse