Best solution for finding 1 x 1 million set intersection? Redis, Mongo, other - mongodb

Hi all and thanks in advance.
I am new to the NoSQL game but my current place of employment has tasked me with set comparisons of some big data.
Our system has customer tag set and targeted tag sets.
A tag is an 8 digit number.
A customer tag set may have up to 300 tags but averages 100 tags
A targeted tag set may have up to 300 tags but averages 40 tags.
Pre calculating is not an option as we are shooting for a potential customer base of a billion users.
(These tags are hierarchical so having one tag implies that you also have its parent and ancestor tags. Put that info aside for the moment.)
When a customer hits our site, we need to intersect their tag set against one million targeted tag sets as fast as possible. The customer set must contain all elements of the targeted set to match.
I have been exploring my options and the set intersection in Redis seems like it would be ideal. However, my trolling through the internet has not revealed how much ram would be required to hold one million tag sets. I realize the intersection would be lightning fast, but is this a feasable solution with Redis.
I realize this is brute force and inefficient. I also wanted to use this question as means to get suggestions for ways this type of problem has been handled in the past. As stated before, the tags are stored in a tree. I have begun looking at Mongodb as a possible solution as well.
Thanks again

This is an interesting problem, and I think Redis can help here.
Redis can store sets of integers using an optimized "intset" format. See http://redis.io/topics/memory-optimization for more information.
I believe the correct data structure here is a collection of targeted tag sets, plus a reverse index to map tags to their targeted tag sets.
To store two targeted tag sets:
0 -> [ 1 2 3 4 5 6 7 8 ]
1 -> [ 6 7 8 9 10 ]
I would use:
# Targeted tag sets
sadd tgt:0 1 2 3 4 5 6 7 8
sadd tgt:1 2 6 7 8 9 10
# Reverse index
sadd tag:0 0
sadd tag:1 0
sadd tag:2 0 1
sadd tag:3 0
sadd tag:4 0
sadd tag:5 0
sadd tag:6 0 1
sadd tag:7 0 1
sadd tag:8 0 1
sadd tag:9 1
sadd tag:10 1
This reverse index is quite easy to maintain when targeted tag sets are added/removed from the system.
The global memory consumption depends on the number of tags which are common to multiple targeted tag sets. It is quite easy to store pseudo-data in Redis and simulate the memory consumption. I have done it using a simple node.js script.
For 1 million targeted tag sets (tags being 8 digits numbers, 40 tags per set), the memory consumption is close to 4 GB when there are very few tags shared by the targeted tag sets (more than 32M entries in the reverse index), and about 500 MB when the tags are shared a lot (only 100K entries in the reverse index).
With this data structure, finding the targeted tag sets containing all the tags of a given customer is extremely efficient.
1- Get customer tag set (suppose it is 1 2 3 4)
2- SINTER tag:1 tag:2 tag:3 tag:4
=> result is a list of targeted tag sets having all the tags of the customer
The intersection operation is efficient because Redis is smart enough to order the sets per cardinality and starts with the set having the lowest cardinality.
Now I understand you need to implement the converse operation (i.e. finding the targeted tag sets having all their tags in the customer tag set). The reverse index can still help.
Here in an example in ugly pseudo-code:
1- Get customer tag set (suppose it is 1 2 3 4)
2- SUNIONSTORE tmp tag:1 tag:2 tag:3 tag:4
=> result is a list of targeted tag sets having at least one tag in common with the customer
3- For t in tmp (iterating on the selected targeted tag sets)
n = SCARD tgt:t (cardinality of the targeted tag sets)
intersect = SINTER customer tgt:t
if n == len(intersect), this targeted tag set matches
So you never have to test the customer tag set against 1M targeted tag sets. You can rely on the reverse index to restrict the scope of the search to an acceptable level.

this might be helpful:
Case Study: Using Redis intersect on very large sets (120M+ with 120M+)
http://redis4you.com/articles.php?id=016&name=Case+Study%3A+Using+Redis+intersect+on+very+large+sets

The answers provided helped me initially. However as our customer base grew, I stumbled across a great technique involving using redis string bits and bit operators to perform analytics on hundreds of millions of users very quickly.
Check this article out. Antirez, creator of redis, also references this a lot.
http://blog.getspool.com/2011/11/29/fast-easy-realtime-metrics-using-redis-bitmaps/

Related

Assign weight to attribute based on there value is grater or lesser then some value: algolia

Hi I am trying to find a search solution where I can assign a weight (x point) to an attribute if its value is greater or smaller then some value ( Y value)
Like if the price is greater then 10 USD then assign 5 points to the item, and I am assigning points on multiple attribute, then get the list of item on the bases of total points in asc or desc order, how can i do this in algolia
Algolia doesn't work with weights, but with a tie-breaking strategy that decides how to rank results based on their attributes. This strategy is static, and set at indexing time.
In your case, you're willing to rank results by a multitude of criteria, including by price. The easiest way to do this is to use the customRanking attribute and set each attribute that should play a role in the ranking strategy. For example, if you want more expensive items to be ranked higher, you can do the following (JavaScript example, but you have a choice between 11 different languages):
index.setSettings({
customRanking: [
'desc(price)'
]
});
Notice the customRanking property takes an array. You can pass several criteria for your custom ranking, and they will be taken into account in the defined order, if the engine can't break the tie.
Since you're working with prices, you may end up in a case where two prices are so close that it makes no sense to break the tie on them; and you'll want to move on to the next criterion. In this case, you can add a new attribute with a rounded price and use this one as the custom ranking attribute. There's a guide in the documentation on that topic.

Defining relevant indices for database indexing

I need to define and create indices for a postgresql DB used for translation memory.
This is related to this (Database design question regarding performance) question I've posted and the oversimplified design follows this (How to design a database for translation dictionary?) answer. The only difference being I have a Segment (basically a sentence instead of a word).
Tables:
I. languages
ID NAME
---------------
1 English
2 Slovenian
II. segments
ID CONTENT LANGUAGE_ID
-------------------------------
1 Hello World 1
2 Zdravo, svet 2
III. translation_records (TranslationRecord has more columns, omitted here, like domain, user etc.)
ID SOURCE_SEGMENT_ID TARGET_SEGMENT_ID
--------------------------------------
1 1 2
I want to index the segments table for when searching existing translations and for when searching combination of words in the DB.
My question is, is it enough to create an index for the segments table for the CONTENT column or should I also tokenize the CONTENT column to a new column TOKENS and index that as well?
Also, am I missing something else that might be important for creating such indices?
---EDIT---
Querying examples:
When a user enters a new text to translate, the app returns predefined number of existing translation records where source segment's content matches by a certain percent with the entered text.
When a user triggers a manual query to list a predefined number of existing translation records where source segment's content includes the words marked by the user (i.e. the concordance search).
Since there is only one table for all language combinations the first condition for querying would be the language_combination (attribute of translation_record).
---EDIT---

Plotting frequencies in Seaborn

I'm looking at an SNL dataset and I want to use seaborn to take a look at a couple different things.
I'm using this to learn more about visualizations in jupyter (aka I'm a beginner).
The data set looks like this:
aid: actor
capacity: what their role was
charid: unique character id
impid: unique impersonation id
role: name of role they played
tid: sketch id
voice: were they just a voiceover?
epid: episode id
sid: season id
Some questions:
Who are the top 20 actors who appeared on SNL?
The characters used most frequently?
The impressions most frequently?
Which characters were played by multiple actors?
I tried this but it's so many people, I want to limit it to maybe 20 people. Or if you have suggestions of other visualizations to try I'm all ears.
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(5,5))
sns.countplot(y="aid", data=appearances);
Some example plots of how to answer some of these questions would be amazing!!
Your question is quite broad but in general, for each series, you can do this:
Get the count for each unique element of a specific column and only take the 20 elements with the highest count:
top20aid=appearances.groupby(['aid'],sort=False)['aid'].count().sort_values().tail(20)
bot20aid=appearances.groupby(['aid'],sort=False)['aid'].count().sort_values().head(20)
NB: sort_values, by default, sort in ascending order. Hence to find the values with the highest count we use tail(). You can sort in descending order by using the following .sort_values(ascending=False). In this case you would select the elements with the highest count using .head(). e.g.
top20aid=appearances.groupby(['aid'],sort=False)['aid'].count().sort_values(ascending=False).head(20)
bot20aid=appearances.groupby(['aid'],sort=False)['aid'].count().sort_values(ascending=False).tail(20)
Then simply plot the results in a barplot
sns.barplot(top20aid.values,top20aid.index)

recover sort order/position values using magmi with multiple website/store/storeviews

I've been using Magmi with great success, creating and updating our magento products on a daily basis.
Our production retail site generally uses the default/admin values for store. When I make new categories and populate them I generally use the category_reset=0 column to preserve the handmade sort order or position values for all of the original categories.
I've been working on a wholesale site set up with a seperate filesystem for all 3 levels of the Magento hierarchy. I did an import with magmi setting the store column to the wholesale site, with 2 additional collumns - sku and category_ids (without category_reset) using a sub-set of data exported from the admin store view (filtered the manufacturer column for only one manufacturer) to try to populate the wholesale site categories (same root catalog with certain categories disabled or not visible) with the same category products.
For some reason, I'm not sure why, (ouch, I realize now there was a typo in the header name for store) it did not update the right store - it defaulted back to admin and lost
the sort order for many categories, about 3k products imported ok.
I have 2 non-production sandbox sites with duplicate category data. I've been manually copying the category product listings with the desired position values into a new csv so I will have sku,category_id (singular),position_value
Many products are members of more than one category. My question is...
In order to regain the position values or sort order, what syntax should I use under category_ids? The products are already in the category so I would use a category_reset=0 column, right?
for an example record:
sku category_ids
45000 39,262,353
my next import might look like:
sku category_ids category_reset
abc 39::10 0
def 39::20 0
45000 39::30 0
ghi 262::10 0
45000 262::20 0
jkl 262::30 0
45000 353::10 0
mno 353::20 0
does this seem workable? I'm feeling very gunshy after having borked my production site with a typo and need some validation before I take steps to confuse myself further.
Thanks in advance for any insight.
As stated in the Magmi Documentation for Importing item positions in categories (from magmi version 0.7.18), the syntax is as follows:
sku,....,category_ids
000001,...,"8::1" < = put sku 00001 at position 1 in category with id 8
000002,...,"9::4,7" < = put sku 00002 at position 4 in category with id 9 and at position 0 in category with id 7
000003,...,"8::10" <= put sku 00002 at position 10 in category with id 8
So yes, your method should work. Be sure to do a full database backup before doing major import changes ;)

Efficiently updating cosine similarity scores

My iPhone application is using a SQLite database with the following schema:
items(id, name, ...) -> this table contains 50 records
tags(id, name) -> this table contains 50 records
item_tags(id, item_id, tag_id, user_id)
similarities(id, item1_id, item2_id, score)
The items, tags, item_tags and similarities tables are populated with pre-defined records, hence also the similarities between different items have already been calculated offline (using cosine similarity algorithm based on the items' tags).
Users are able to add additional tags to items and to remove their custom tags later on. Whenever this happens the similarity scores between the items should be updated locally, i.e. without contacting the server application.
My question now is the following:
What is the most efficient way to do so? So far, on startup of the iPhone application, I compute a term-document matrix for all the items and tags (which reflects the tag frequencies for each item) and keep this matrix in memory as long as the application is running. Whenever a tag is added or removed, I use this matrix to update the similarities in the database. However, this is rather inefficient. Do you have any suggestions?
Thanks!
This presentation might help you:
http://www.slideshare.net/jnvms/incremental-itembased-collaborative-filtering-4095306