Classify many documents with many different unordered items - classification

I've the following documents with different unordered items, e.g.,
doc_1:
item 1,
item 2,
...
item n
doc_2:
item 7,
item 3,
...
item n
.
.
.
doc_n:
item 20,
item 17,
...
item n
how To classify into similar groups, e.g.,
1- doc_1:
all item 1
2- doc_2:
all item 2
...etc.

From your comment, How to collect similar items in different documents?, I am assuming you want to compute similarity score between lines in a document to the lines in another document.
One simple approach is to represent one line/sentence of a document through bag-of-words model. Then you can compute cosine similarity of two sentence/line representations.
Given two lines of two different documents are textually very close, you should observe higher cosine similarity between their sentence representation.
Please note, a useful representation and similarity computation depends on the problem that you are trying to solve. Say for example, if different documents contain product reviews (one or more lines) associated with different products (cell phones, laptops) and you want to collect the lines associated with a single product, you can simply follow the approach mentioned above.
Also note, the solution I have suggested, its pretty naive. To achieve higher accuracy in your desired task, you may need to design more focused which will be effective for the target task.

Related

Most appropriate analysis method - Clustering?

I have 2 large data frames with similar variables representing 2 separate surveys. Some rows (participants) in each data frame correspond to the other and I would like to link these two together.
There is an index in both dataframes though this index indicates locality of the survey (i.e region) and not individual IDs.
Merging is not possible as in most cases there is an identical index values for different participants.
Given that merging on an index value from the 2 data frames is not possible, I wish to compare similar variables (binary) from both data frames that (in addition to the index values common to both data frame) in order to give me a highest likelihood of a match. I can then (with some margin of error) match rows with the most similar values for similar variables and merge them together.
What do you think would be the appropriate method for doing this? Clustering?
Best,
James
That obviously is not clustering. You don't want large groups of records.
What you want to do is an approximate JOIN.

Tableau Dual Axis with different filters

I am trying to create a graph with two lines, with two filters from the same dimension.
I have a dimension which has 20+ values. I'd like one line to show data based on just one of the selected values and the other line to show a line excluding that same value.
I've tried the following:
-Creating a duplicate/copy dimension and filtering the original one with the first, and the copy with the 2nd. When I do this, the graphic disappears.
-Creating a calculated field that tries to split the measures up. This isn't letting me track the count.
I want this on the same axis; the best I've been able to do is create two sheets, one with the first filter and one with the 2nd, and stack them in a dashboard.
My end user wants the lines in the same visual, otherwise I'd be happy with the dashboard approach. Right now, though, I'd also like to know how to do this.
It is a little hard to tell exactly what you want to achieve, but the problem with filtering is common.
The principle that is important is that Tableau will filter the whole dataset by row. So duplicating the dimension you want to filter won't help as the filter on the original dimension will also filter the corresponding rows in the second dimension. Any solution has to be clever enough to work around this issue.
One solution is to build two new dimensions that use a calculation rather than a filter to create the new result. Let's say you have a dimension, [size] that has a range of numbers from 1 to 10 and you want to compare the total number of rows including and excluding the number 5. You could create a new field using a formula like if [size] <> 5 then 1 else 0 end
Summing the new field will give a count of the number of rows that don't contain a 5 and this can be compared directly to a rowcount of the original [size] field which will give the number including the value 5.
This basic principle can be extended to much more complex logic. The essential point is to realise that filters act on every row in your data and can't, by themselves, show comparisons with alternative filter choices on a single visualisation.
Depending on the nature of your problem there may be other solutions worth looking at including sets and groups but you would need to provide more specific details for users here to tell you whether they would be useful.
We can make a a set out of the values of the dimension and then place it in the required shelf. So, you will have your dimension which will plot accordingly and set which will have data as per the requirement because with filter you can't have that independence of showing data everytime you want.

Function to check similarity of a string with a list of strings

I have a collection of similar strings in a bucket and have multiple such buckets. What kind of function should I use on the strings to compare a random string with the buckets to find out which bucket it belongs to?
To clarify on each entity in the bucket, it is a sentence that can have multiple words.
An example:
Consider the list of strings in the bucket:
1. round neck black t-shirt
2. printed tee shirt
3. brown polo t-shirt
If we have as input, "blue high neck t-shirt", we want to check if this can be added to the same bucket. This might be a simpler example but consider doing this for a bucket of let's say 100s of strings.
Any reference to an article or paper will be really helpful.
First of all, I am thinking about two kinds of similarities: syntactic and semantic.
1) Syntactic
Levenstein distance can be used to measure distance between two sequences (of characters)
Stemming can be used to increase match probability. There are various implementations (e.g. here). You get the stem (or root) word for your random string and compare with stems from your buckets. Of course, buckets stems should be precomputed for efficiency.
2) Semantic
for general info you can read the article from Wikipedia
for actual implementation you can read this article from CodeProject. It is based on WordNet, a great ontology for English language that stores concepts in synsets and also provides various relations between these synsets.
To get more details you should tell us what kind of similarity you need.
[edit]
Based on provided information, I think you can do something like this:
1) split all strings in words => random string will be named Array1 and current bucket Array2
2) compute similarity as number_of_common_words(Array1, Array2) / count(Array2)
3) choose maximum similarity
Specificity may also be increased, by adding points to position match: Array1[i] = Array2[i]
For better performance I would store buckets as Hash tables, Dictionary etc., so that existence check is done in O(1).

Sphinx: Understanding Proximity Factor Ranking for a large field

This document explains the sphinx proximity factor ranking algorithm (see section: Phrase proximity factor).
Will the proximity factor ranker give a higher field weight if the keyword is found more than once in the field?
eg. using similar logic to the referenced section of the linked document. For a single instance match phrase weight would be 2:
1) query = one two three, field = one and two three
field_phrase_weight = 2 (because 2-keyword long "two three" subphrase matched)
What about if the same phrase was matched twice? Would the weight be double?:
2) query = one two three, field = one and two three one and two three
field_phrase_weight = 4? (because 2-keyword long "two three" subphrase matched twice?)
I suspect that the answer to the above question is no - sphinx will return the same field weight whether the keyword/keyword subsequence is found multiple times. If this is the case, how to make good use of the proximity algorithm for large sphinx fields like an essay if this will return the same field weight regardless of content size? Especially given the go-to sphinx algorithm for searching is the proximity_bm25 ranker which relies very heavily on "proximity ranking" (for a multiple field document at least 60% of the algorithm would be weighted towards proximity ranking over bm25?
Will the proximity factor ranker give a higher field weight if the keyword is found more than once in the field?
No. The same field weight will apply.
eg. using similar logic to the referenced section of the linked document. For a single instance match phrase weight would be 2:
1) query = one two three, field = one and two three
field_phrase_weight = 2 (because 2-keyword long "two three" subphrase matched)
What about if the same phrase was matched twice? Would the weight be double?:
2) query = one two three, field = one and two three one and two three
field_phrase_weight = 4? (because 2-keyword long "two three" subphrase matched twice?)
In this example, the weight would not be double in the 2nd query.
how to make good use of the proximity algorithm for large sphinx fields like an essay if this will return the same field weight regardless of content size?
The only way I can figure is through a combination of the proximity to give the higher weights to multi keyword phrases, but also to give the BM25 enough weight in the algorithm to add value in providing the "rare keywords occurring more often in documents" factor. The BM25 part of the Proximity_BM25 ranking algorithm is designed for this purpose.
This is the proximity_bm25 expression: sum(lcs*user_weight)*1000+bm25, the bm25 component of this algorithm progressively becomes more irrelevant the more sphinx fields you have in a document as the sum(lcs*user_weight)*1000 part of the formulate applies to each individual field while the bm25 part of the equation applies to the document as a whole.
In my situation with 10 sphinx fields, the bm25 was accounting for just 5% of the total weight - I bumped up the weight of the bm25 portion of the formula to account for around 20% of the total weight changing the formula as such:
sum(lcs*user_weight)*1000+bm25*4

Mahout Log Likelihood similarity metric behaviour

The problem I'm trying to solve is finding the right similarity metric, rescorer heuristic and filtration level for my data. (I'm using 'filtration level' to mean the amount of ratings that a user or item must have associated with it to make it into the production database).
Setup
I'm using mahout's taste collaborative filtering framework. My data comes in the form of triplets where an item's rating are contained in the set {1,2,3,4,5}. I'm using an itemBased recommender atop a logLikelihood similarity metric. I filter out users who rate fewer than 20 items from the production dataset. RMSE looks good (1.17ish) and there is no data capping going on, but there is an odd behavior that is undesireable and borders on error-like.
Question
First Call -- Generate a 'top items' list with no info from the user. To do this I use, what I call, a Centered Sum:
for i in items
for r in i's ratings
sum += r - center
where center = (5+1)/2 , if you allow ratings in the scale of 1 to 5 for example
I use a centered sum instead of average ratings to generate a top items list mainly because I want the number of ratings that an item has received to factor into the ranking.
Second Call -- I ask for 9 similar items to each of the top items returned in the first call. For each top item I asked for similar items for, 7 out of 9 of the similar items returned are the same (as the similar items set returned for the other top items)!
Is it about time to try some rescoring? Maybe multiplying the similarity of two games by (number of co-rated items)/x, where x is tuned (around 50 or something to begin with).
Thanks in advance fellas
You are asking for 50 items similar to some item X. Then you look for 9 similar items for each of those 50. And most of them are the same. Why is that surprising? Similar items ought to be similar to the same other items.
What's a "centered" sum? ranking by sum rather than average still gives you a relatively similar output if the number of items in the sum for each calculation is roughly similar.
What problem are you trying to solve? Because none of this seems to have a bearing on the recommender system you describe that you're using and works. Log-likelihood similarity is not even based on ratings.