Fuzzy match for two variables in a dataset - match

How do I do a fuzzy match (approximately 75% match) between two variables in a Stata dataset?
In my example, I am producing Match_yes = 1 if the value in Brand_1 is present in Brand_2:
**Brand_1 Brand_2 Match_yes**
Samsung Samsung 1
Microsoft Apple 0
Apple Sony 1
Panasonic Motorola 0
Miumiu 0
Mottorrola 1
LG 0
How do I get the value Mottorrola under variable Brand_1 to produce a Match_yes = 1, as it is 80% similar to the value Motorola in variable Brand_2?

Using your toy example:
clear
input strL(Brand_1 Brand_2)
Samsung Samsung
Microsoft Apple
Apple Sony
Panasonic Motorola
Miumiu
Mottorrola
LG
end
Here is a 'hack' using the community-contributed command matchit to produce the desired output:
local obs = _N
generate Cont = 0
forvalues i = 1 / `obs' {
forvalues j = 1 / `obs' {
replace Cont = 1 if Brand_1[`i'] == Brand_2[`j'] in `i'
generate b1 = Brand_1[`i'] in 1
generate b2 = Brand_2[`j'] in 1
matchit b1 b2, generate(simscore)
generate score`i'`j' = simscore
replace Cont = 1 if score`i'`j'[1] > 0.80 in `i'
drop b1 b2 simscore
}
}
drop score*
list
+------------------------------+
| Brand_1 Brand_2 Cont |
|------------------------------|
1. | Samsung Samsung 1 |
2. | Microsoft Apple 0 |
3. | Apple Sony 1 |
4. | Panasonic Motorola 0 |
5. | Miumiu 0 |
|------------------------------|
6. | Mottorrola 1 |
7. | LG 0 |
+------------------------------+

Related

two-variable output truth table for two-variable input

I have 2-bit variable that is needed to be converted in another one. I made such a table
i1 i2 | o1 o2
0 0 | x x
0 1 | 0 1
1 0 | 1 0
1 1 | 0 1
But I cannot figure out how to do it except something like
(o1(i1,i2)&0b01 << 1) | (o2(i1,i2) & 0b01)

Stata merge with multiple match variables

I am having difficulty combining datasets for a project. Our primary dataset is organized by individual judges. It is an attribute dataset.
judge
j | x | y | z
----|----|----|----
1 | 2 | 3 | 4
2 | 5 | 6 | 7
The second dataset is a case database. Each observation is a case and judges can appear in one of three variables.
case
case | j1 | j2 | j3 | year
-----|----|----|----|-----
1 | 1 | 2 | 3 | 2002
2 | 2 | 3 | 1 | 1997
We would like to merge the case database into the attribute database, matching by judge. So, for each case that a judge appears in j1, j2, or j3, an observation for that case would be added creating a dataset that looks like below.
combined
j | x | y | z | case | year
---|----|----|----|-------|--------
1 | 2 | 3 | 4 | 1 | 2002
1 | 2 | 3 | 4 | 2 | 1997
2 | 5 | 6 | 7 | 1 | 2002
2 | 5 | 6 | 7 | 2 | 1997
My best guess is to use
rename j1 j
merge 1:m j using case
rename j j1
rename j2 j
merge 1:m j using case
However, I am unsure that this will work, especially since the merging dataset has three possible variables that the j identification can occur in.
Your examples are clear, but even better would be present them as code that would not require engineering edits to remove the scaffolding. See dataex from SSC (ssc inst dataex).
It's a case of the missing reshape, I think.
clear
input j x y z
1 2 3 4
2 5 6 7
end
save judge
clear
input case j1 j2 j3 year
1 1 2 3 2002
2 2 3 1 1997
end
reshape long j , i(case) j(which)
merge m:1 j using judge
list
+-------------------------------------------------------+
| case which j year x y z _merge |
|-------------------------------------------------------|
1. | 1 1 1 2002 2 3 4 matched (3) |
2. | 2 3 1 1997 2 3 4 matched (3) |
3. | 2 1 2 1997 5 6 7 matched (3) |
4. | 1 2 2 2002 5 6 7 matched (3) |
5. | 2 2 3 1997 . . . master only (1) |
|-------------------------------------------------------|
6. | 1 3 3 2002 . . . master only (1) |
+-------------------------------------------------------+
drop if _merge < 3
list
+---------------------------------------------------+
| case which j year x y z _merge |
|---------------------------------------------------|
1. | 1 1 1 2002 2 3 4 matched (3) |
2. | 2 3 1 1997 2 3 4 matched (3) |
3. | 2 1 2 1997 5 6 7 matched (3) |
4. | 1 2 2 2002 5 6 7 matched (3) |
+---------------------------------------------------+

YUI 3 Chart Axis Label Positioning?+

I have created a chart using YUI and want to display axis labels. The x axis is fine, but the y axis label appears inside of the data.
Here is what is happening:
10 |
9 |
8 |
7 l |
6 a |
5 b | CHART
4 e |
3 l |
2 |
1 |
0 |_____________________________
Here is what I want to happen:
10 |
9 |
8 |
l 7 |
a 6 |
b 5 | CHART
e 4 |
l 3 |
2 |
1 |
0 |_____________________________
Here is my code for the chart axes:
var chartaxes = {
timeelapsed:{
position:"bottom",
type:"category",
title:"label"
},
kWh:{
position:"left",
type:"numeric",
title:"label",
}
};
Is there any way to fix this?
You need to set the series that is associated with the axes to display the title correctly using the keys variable.
var chartaxes = {
timeelapsed:{
position:"bottom",
type:"category",
title:"Time Elapsed (minutes)",
keys: ["category"],
and so on.

Difference between correctly / incorrectly classified instances in decision tree and confusion matrix in Weka

I have been using Weka’s J48 decision tree to classify frequencies of keywords
in RSS feeds into target categories. And I think I may have a problem
reconciling the generated decision tree with the number of correctly classified
instances reported and in the confusion matrix.
For example, one of my .arff files contains the following data extracts:
#attribute Keyword_1_nasa_Frequency numeric
#attribute Keyword_2_fish_Frequency numeric
#attribute Keyword_3_kill_Frequency numeric
#attribute Keyword_4_show_Frequency numeric
...
#attribute Keyword_64_fear_Frequency numeric
#attribute RSSFeedCategoryDescription {BFE,FCL,F,M, NCA, SNT,S}
#data
0,0,0,34,0,0,0,0,0,40,0,0,0,0,0,0,0,0,0,0,24,0,0,0,0,13,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,10,0,0,0,0,0,11,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,BFE
...
20,0,64,19,0,162,0,0,36,72,179,24,24,47,24,40,0,48,0,0,0,97,24,0,48,205,143,62,78,
0,0,216,0,36,24,24,0,0,24,0,0,0,0,140,24,0,0,0,0,72,176,0,0,144,48,0,38,0,284,
221,72,0,72,0,SNT
...
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,S
And so on: there’s a total of 64 keywords (columns) and 570 rows where each one contains the frequency of a keyword in a feed for a day. In this case, there are 57 feeds for
10 days giving a total of 570 records to be classified. Each keyword is prefixed
with a surrogate number and postfixed with ‘Frequency’.
My use of the decision tree is with default parameters using 10x validation.
Weka reports the following:
Correctly Classified Instances 210 36.8421 %
Incorrectly Classified Instances 360 63.1579 %
With the following confusion matrix:
=== Confusion Matrix ===
a b c d e f g <-- classified as
11 0 0 0 39 0 0 | a = BFE
0 0 0 0 60 0 0 | b = FCL
1 0 5 0 72 0 2 | c = F
0 0 1 0 69 0 0 | d = M
3 0 0 0 153 0 4 | e = NCA
0 0 0 0 90 10 0 | f = SNT
0 0 0 0 19 0 31 | g = S
The tree is as follows:
Keyword_22_health_Frequency <= 0
| Keyword_7_open_Frequency <= 0
| | Keyword_52_libya_Frequency <= 0
| | | Keyword_21_job_Frequency <= 0
| | | | Keyword_48_pic_Frequency <= 0
| | | | | Keyword_63_world_Frequency <= 0
| | | | | | Keyword_26_day_Frequency <= 0: NCA (461.0/343.0)
| | | | | | Keyword_26_day_Frequency > 0: BFE (8.0/3.0)
| | | | | Keyword_63_world_Frequency > 0
| | | | | | Keyword_31_gaddafi_Frequency <= 0: S (4.0/1.0)
| | | | | | Keyword_31_gaddafi_Frequency > 0: NCA (3.0)
| | | | Keyword_48_pic_Frequency > 0: F (7.0)
| | | Keyword_21_job_Frequency > 0: BFE (10.0/1.0)
| | Keyword_52_libya_Frequency > 0: NCA (31.0)
| Keyword_7_open_Frequency > 0
| | Keyword_31_gaddafi_Frequency <= 0: S (32.0/1.0)
| | Keyword_31_gaddafi_Frequency > 0: NCA (4.0)
Keyword_22_health_Frequency > 0: SNT (10.0)
My question concerns reconciling the matrix to the tree or vice versa. As far as
I understand the results, a rating like (461.0/343.0) indicates that 461 instances have been classified as NCA. But how can that be when the matrix reveals only 153? I am
not sure how to interpret this so any help is welcome.
Thanks in advance.
The number in parentheses at each leaf should be read as (number of total instances of this classification at this leaf / number of incorrect classifications at this leaf).
In your example for the first NCA leaf, it says there are 461 test instances that were classified as NCA, and of those 461, there were 343 incorrect classifications. So there are 461-343 = 118 correctly classified instances at that leaf.
Looking through your decision tree, note that NCA is also at other leaves. I count 118 + 3 + 31 + 4 = 156 correctly classified instances out of 461 + 3 + 31 + 4 = 499 total classifications of NCA.
Your confusion matrix shows 153 correct classifications of NCA out of 39 + 60 + 72 + 69 + 153 + 90 + 19 = 502 total classifications of NCA.
So there is a slight difference between the tree (156/499) and your confusion matrix (153/502).
Note that if you are running Weka from the command-line, it outputs a tree and a confusion matrix for testing on all the training data and also for testing with cross-validation. Be careful that you are looking at the right matrix for the right tree. Weka outputs both training and test results, resulting in two pairs of matrix and tree. You may have mixed them up.

MATLAB: Identify if a value is repeated sequentially N times in a vector

I am trying to identify if a value is repeated sequentially in a vector N times. The challenge I am facing is that it could be repeated sequentially N times several times within the vector. The purpose is to determine how many times in a row certain values fall above the mean value. For example:
>> return_deltas
return_deltas =
7.49828129642663
11.5098198572327
15.1776644881294
11.256677995536
6.22315734182976
8.75582103474613
21.0488849115947
26.132605745393
27.0507649089989
...
(I only printed a few values for example but the vector is large.)
>> mean(return_deltas)
ans =
10.50007490258002
>> sum(return_deltas > mean(return_deltas))
ans =
50
So there are 50 instances of a value in return_deltas being greater than the mean of return_deltas.
I need to identify the number of times, sequentially, the value in return_deltas is greater than its mean 3 times in a row. In other words, if the values in return_deltas are greater than its mean 3 times in a row, that is one instance.
For example:
---------------------------------------------------------------------
| `return_delta` value | mean | greater or less | sequence |
|--------------------------------------------------------------------
| 7.49828129642663 |10.500074902 | LT | 1 |
| 11.5098198572327 |10.500074902 | GT | 1 |
| 15.1776644881294 |10.500074902 | GT | 2 |
| 11.256677995536 |10.500074902 | GT | 3 * |
| 6.22315734182976 |10.500074902 | LT | 1 |
| 8.75582103474613 |10.500074902 | LT | 2 |
| 21.0488849115947 |10.500074902 | GT | 1 |
| 26.132605745393 |10.500074902 | GT | 2 |
| 27.0507649089989 |10.500074902 | GT | 3 * |
---------------------------------------------------------------------
The star represents a successful sequence of 3 in a row. The result of this set would be two because there were two occasions where the value was greater than the mean 3 times in a row.
What I am thinking is to create a new vector:
>> a = return_deltas > mean(return_deltas)
that of course contains ones where values in return_deltas is greater than the mean and using it to find how many times sequentially, the value in return_deltas is greater than its mean 3 times in a row. I am attempting to do this with a built in function (if there is one, I have not discovered it) or at least avoiding loops.
Any thoughts on how I might approach?
With a little work, this snippet finds the starting index of every run of numbers:
[0 find(diff(v) ~= 0)] + 1
An Example:
>> v = [3 3 3 4 4 4 1 2 9 9 9 9 9]; # vector of integers
>> run_starts = [0 find(diff(v) ~= 0)] + 1 # may be better to diff(v) < EPSILON, for floating-point
run_starts =
1 4 7 8 9
To find the length of each run
>> run_lengths = [diff(run_starts), length(v) - run_starts(end) + 1]
This variables then makes it easy to query which runs were above a certain number
>> find(run_lengths >= 4)
ans =
5
>> find(run_lengths >= 2)
ans =
1 2 5
This tells us that the only run of at least four integers in a row was run #5.
However, there were three runs that were at least two integers in a row, specifically runs #1, #2, and #5.
You can reference where each run starts from the run_starts variable.