TF-IDF and Rocchio classification in Introduction to Information Retrieval - classification

I'm looking at table 14.1 from Vector Space Classification (chapter at link) in Introduction to Information Retrieval which Example 14.1 says "shows the tf-idf vector representations of the five documents in Table 13.1 using the formula (1 + log tf) * log(4/df) if tf > 0. Yet, when I look at Table 14.1, it does not appear that this TF-IDF formula is applied to the document vectors.
The documents from table 13.1 are:
1: Chinese Beijing Chinese
2: Chinese Chinese Shanghai
3: Chinese Macao
4: Tokyo Japan Chinese
and the term weights for the vectors in Table 14.1 are:
vector Chinese Japan Tokyo Macao Beijing Shanghai
d1 0 0 0 0 1.0 0
d2 0 0 0 0 0 1.0
d3 0 0 0 1.0 0 0
d4 0 0.71 0.71 0 0 0
If I apply the TF-IDF formula to the Japan dimension of d4, I get:
TF: 1 (term appears once in document 4)
DF: log(4 / 1) (term is present in only document 4)
TF-IDF Weight is thus: log(4) ~ .60
Why does my calculation outcome different from what is included in the text?

You have computed tf-idf correctly. The text is a bit misleading when it says
Table 14.1 shows the tf-idf vector representations of the five documents
in Table 13.1.
It is actually showing the tf-idf vector representations normalized to unit length.
The details:
Document 4 has three words "Tokyo", "Japan" and "Chinese".
You correctly computed that the TF-IDF weights for both "Tokyo" and "Japan"
should be
log10(4) ≈ 0.60. "Chinese" is in all documents, so the IDF part
of its weight is log(4/4) = 0 and the weight for "Chinese" is zero.
So the vector for document 4 is
Chinese Japan Tokyo Macao Beijing Shanghai
0 0.60 0.60 0 0 0
But the length of this vector is sqrt(0.60^2 + 0.60^2) ≈ 0.85 To get a vector of unit length, all components are divided by 0.85 giving the vector in the text
Chinese Japan Tokyo Macao Beijing Shanghai
0 0.71 0.71 0 0 0
It may be worth noting that the reason that we use vectors of unit length is to adjust for documents of different lengths. Without this adjustment, long documents would generally match queries better than short documents.

Related

KDB+/Q: Custom min max scaler

Im trying to implement a custom min max scaler in kdb+/q. I have taken note of the implementation located in the ml package however I'm looking to be able to scale data between a custom range i.e. 0 and 255. What would be an efficient implementation of min max scaling in kdb+/q?
Thanks
Looking at the link to github on the page you referenced it looks like you may be able to define a function like so:
minmax255:{[sf;x]sf*(x-mnx)%max[x]-mnx:min x}[255]
Where sf is your scaling factor (here given by 255).
q)minmax255 til 10
0 28.33333 56.66667 85 113.3333 141.6667 170 198.3333 226.6667 255
If you don't like decimals you could round to the nearest whole number like:
q)minmax255round:{[sf;x]floor 0.5+sf*(x-mnx)%max[x]-mnx:min x}[255]
q)minmax255round til 10
0 28 57 85 113 142 170 198 227 255
(logic here is if I have a number like 1.7, add .5, and floor I'll wind up with 2, whereas if I had a number like 1.2, add .5, and floor I'll end up with 1)
If you don't want to start at 0 you could use | which takes the max of it's left and right arguments
q)minmax255roundlb:{[sf;lb;x]lb|floor sf*(x-mnx)%max[x]-mnx:min x}[255;10]
q)minmax255roundlb til 10
10 28 56 85 113 141 170 198 226 255
Where I'm using lb to mean 'lower bound'
If you want to apply this to a table you could use
q)show testtab:([]a:til 10;b:til 10)
a b
---
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
q)update minmax255 a from testtab
a b
----------
0 0
28.33333 1
56.66667 2
85 3
113.3333 4
141.6667 5
170 6
198.3333 7
226.6667 8
255 9
The following will work nicely
minmaxCustom:{[l;u;x]l + (u - l) * (x-mnx)%max[x]-mnx:min x}
As petty as it sounds, it is my strong recommendation that you do not follow through with Shehir94 solution for a custom minimum value. Applying a maximum to get a starting range, it will mess with the original distribution. A custom minmax scaling should be a simple linear transformation on a standard 0-1 minmax transformation.
X' = a + bX
For example, to get a custom scaling of 10-255, that would be a b=245 and a=10, we would expect the new mean to follow this formula and the standard deviation to only be a Multiplicative, but applying lower bound messes with this, for example.
q)dummyData:10000?100.0
q)stats:{`transform`minVal`maxVal`avgVal`stdDev!(x;min y;max y; avg y; dev y)}
q)minmax255roundlb:{[sf;lb;x]lb|sf*(x-mnx)%max[x]-mnx:min x}[255;10]
q)minmaxCustom:{[l;u;x]l + (u - l) * (x-mnx)%max[x]-mnx:min x}
q)res:stats'[`orig`lb`linear;(dummyData;minmax255roundlb dummyData;minmaxCustom[10;255;dummyData])]
q)res
transform minVal maxVal avgVal stdDev
-----------------------------------------------
orig 0.02741043 99.98293 50.21896 28.92852
lb 10 255 128.2518 73.45999
linear 10 255 133.024 70.9064
// The transformed average should roughly be
q)10 + ((255-10)%100)*49.97936
132.4494
// The transformed std devaition should roughly be
q)2.45*28.92852
70.87487
To answer the comment, this could be applied over a large number of coluwould be applied to a table in the following manner
q)n:10000
q)tab:([]sym:n?`3;col1:n?100.0)
q)multiColApply:{[tab;scaler;colList]flip ft,((),colList)!((),scaler each (ft:flip tab)[colList])}
q)multiColApply[tab;minmaxCustom[10;20];`col1`col2]
sym col1 col2 col3
------------------------------
cag 13.78461 10.60606 392.7524
goo 15.26201 16.76768 517.0911
eoh 14.05111 19.59596 515.9796
kbc 13.37695 19.49495 406.6642
mdc 10.65973 12.52525 178.0839
odn 16.24697 17.37374 301.7723
ioj 15.08372 15.05051 785.033
mbc 16.7268 20 534.7096
bhj 12.95134 18.38384 711.1716
gnf 19.36005 15.35354 411.597
gnd 13.21948 18.08081 493.1835
khi 12.11997 17.27273 578.5203

Changing the value of elements in a table, depending on a specific string for MATLAB

Suppose I have a MATLAB table of the following type:
Node_Number Generation_Type Total_power(MW)
1 Wind 600
1 Solar 452
1 Tidal 123
2 Wind 200
2 Tidal 159
What I want to do is to produce a table with exactly same dimensions, with the only difference being the value of the data of the Total_Power column that corresponds to the Wind generation type being multiplied with 0.5. Hence the result that I would get would be:
Node_Number Generation_Type Total_power(MW)
1 Wind 300
1 Solar 452
1 Tidal 123
2 Wind 100
2 Tidal 159
What I believe that would do the trick is some code which would scan all the rows that have the string 'Wind', and then after locating the rows which have this string, to multiply the 3rd column of this row with 0.5. A for loop seems like a viable solution, though I am not sure how to implement this. Any help would be greatly appreciated.
Just find the index of rows with the category Wind, and then you could have access to them by calling T(index,:).
clc; clear;
T=readtable('data.txt');
rows = find(ismember(T.Generation_Type,'Wind'));
T(rows,:).Total_power_MW_=T(rows,:).Total_power_MW_*0.5
Output:
Node_Number Generation_Type Total_power_MW_
___________ _______________ _______________
1 'Wind' 300
1 'Solar' 452
1 'Tidal' 123
2 'Wind' 100
2 'Tidal' 159

Fitting a distribution for array with zeros

I have data from stimulation of subjects with different intensities (say we have 54 different intensities), and what follows is the percentage of them recognizing the respective stimulation:
x = [0 0 0 0.50 0 0 0 0 0 0 0.5 0 0 0 0 0 0 0 0.125000000000000 0 0.333333333333333 0 0 0.111111111111111 0 0.428571428571429 0 0.285714285714286 0.166666666666667 0 0.1 0 0.400000000000000 0.5 0.4 0.25 0.6 0.727272727272727 0.714285714285714 0.25 0.666666666666667 0.777777777777778 1 0.75 0 1 0.9375 1 1 1 1 1 0.92 0.92]
Say the first index is the weakest stimulation, and the last index the strongest, as visible the stronger the stimulation the more likely the subject recognizes it.
I want to fit now a distribution to these values, to get something called a psychophysical curve (looks like this).
What I have tried is:
pd = fitdist(x,distribution);
but this throws an error, I assume because of the 0's in the x array. What could I do alternatively?
As suggested in
Fit a sigmoid to my data using MATLAB
"I think you can use "Curve fitting" App in Matlab. you can find it in APPS, in "Math, statistics and optimization" section."
What you have to do is define two vector of the same length:
one with the stimuli
one with the respone
After, looking at your file, you can try ,using the "Curve Fitting" app in matlab, to fit a sigmoid.
After pressing the generate code button, matlab will create a fuction that will give the same result.

Extract sub-matrix based on conditions of specific columns in matlab

I want to Select a sub-matrix x7992 based on conditions of Certain columns of original data matrix. Specifically, the original matrix is 23166-by-9, follow a original Gauss code
x7992 =selif(data,data[.,col_coh].==0 .and data[.,col_year].<=1992);
I rewrite this in matlab with
x7992 = data(data(:,col_coh)==0 & data(:,col_year)<=1992);
col_coh,col_year are predefined column number.
However, rather than give me a sub-matrix, the above line of code only give me a single row (23166-by-1),it's not what I want (and not the real result base on this condition).So how to fix it? thank you.
--- Update -----
The data matrix is like (I omit other columns because only first 3 cols are relevant to selection), the first column is id for individuals
1 1979 0
1 1980 0
1 1981 1
1 1982 0
1 1983 1
2 1990 0
2 1991 0
2 1992 0
2 1993 1
3 1985 0
3 1986 0
3 1987 0
Based on the conditions, what I want is a submatrix from data, which excludes those rows with value>1992 in the second column and value=1 in the third one
You only get a single column vector as output since your condition vector is returned as a single 23166x1 vector.
To get the entire row of values you need to add a colon as a second argument.
I've splitted the example in two lines to make more readable.
condIdx = data(:,col_coh)==0 & data(:,col_year)<=1992;
x7992 = data( condIdx, :);
If you want specific columns in your result matrix, just put the column number in a vector instead of the colon operator.
colsInResult = [1 2 3];
x7992 = data( condIdx, colsInResult);
Based on the example that you have given, the following will do this:
data(data(:,2)<=1992 & data(:,3)~=1,:)
which gives this output:
1 1979 0
1 1980 0
1 1982 0
2 1990 0
2 1991 0
2 1992 0
3 1985 0
3 1986 0
3 1987 0

Combine data matrices of different frequencies

In MATLAB, how could I combine two matrices of data measured at different frequencies such that the result is indexed at the higher frequency? Since the data measured at the lower frequency will have many unknown values in the result, I would like to replace them with the last known value in the matrix. There is a lot of data so a vectorized solution would be preferred. I've added some sample data below.
Given:
index1 data1 index2 data2
1 2.1 2 30.5
2 3.3 6 32.0
3 3.5 9 35.0
4 3.9 13 35.5
5 4.5 17 34.5
6 5.0 20 37.0
7 5.2 ... ...
8 5.7
9 6.8
10 7.9
... ...
Result:
index1 data1 data2
1 2.1 NaN
2 3.3 30.5
3 3.5 30.5
4 3.9 30.5
5 4.5 30.5
6 5.0 32.0
7 5.2 32.0
8 5.7 32.0
9 6.8 35.0
10 7.9 35.0
... ... ...
EDIT:
I think the following post is close to what I need, but I'm not sure how to transform the solution to fit my problem.
http://www.mathworks.com/matlabcentral/newsreader/view_thread/260139
EDIT (Several Months Later):
I've recently come across this excellent little function that I think may be of use to anyone who lands on this post:
function yi = interpLast(x,y,xi)
%INTERPLAST Interpolates the input data to the last known value.
% Note the index data should be input in ASCENDING order.
inds = arrayfun(#findinds, xi);
yi = y(inds);
function ind = findinds(val)
ind = find(x<=val,1,'last');
if isempty(ind)
ind = 1;
end
end
end
Credit goes here: http://www.mathworks.com/support/solutions/en/data/1-48KETY/index.html?product=SL&solution=1-48KETY
The problem is one of run length decoding. See section 15.5.2 of Matlab array manipulation tips and tricks (which is an eye-opening read for any Matlab enthusiast).
Here's using the method with your example (I'm using octave but the code is identical for Matlab):
octave:33> a=[2,30.5;6,32;9,35;13,35.5;17,34.5;20,37]
a =
2.0000 30.5000
6.0000 32.0000
9.0000 35.0000
13.0000 35.5000
17.0000 34.5000
20.0000 37.0000
octave:34> i=a(:,1)-1
i =
1
5
8
12
16
19
octave:35> j=zeros(1,i(end))
j =
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
octave:36> j(i(1:end-1)+1)=1
j =
0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0
octave:37> j(1)=1
j =
1 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0
octave:38> val=a(:,2)
val =
30.500
32.000
35.000
35.500
34.500
37.000
octave:39> x=val(cumsum(j))
x =
30.500
32.000
32.000
32.000
32.000
35.000
35.000
35.000
35.500
35.500
35.500
35.500
34.500
34.500
34.500
34.500
37.000
37.000
37.000
And pad the beginning with NaN as needed.
I recently had the same problem as you: I had data, measured by different systems, which had to be synchronized and processed.
My solution consisted of putting the measurement data and time information (frequency, time at start of measurements) in a class object. Then I implemented a multiplication, addition, etc. method for that class that automatically took care of all the necessary things, being:
upsampling the lower frequency signal (with linear interpolation (interp1)
shifting one of the signals, so the data lines up in time
cutting off the non-overlapping data set at beginning and end (with two different systems you never start or stop measuring at the same time, so there is some excess data)
actually performing the multiplication
returning the result as a new class object
Next to that there were other functions of which you can guess what they do: plot, lpf, mean, getTimeAtIndex, getIndexAtTime, ...
This allowed me to simply do
signalsLabview = importLabViewSignals(LabViewData);
signalsMatlab = importMatlabSignals(MatlabData, 100); %hz
hydrPower = signalsLabview.flow * signalsMatlab.pressure;
plot(hydrPower);
or things like that. If you have a lot of these signals on which you have to do some math, this really helps and results in clear code. Otherwise you have a lot of general code just for doing the syncing, shifting, trimming around each operation. Also for quickly checking things it's easy.
If you have to do this things a lot, I think it's definitely worth investing some time in it to build a proper framework.
Unfortunately I don't think I can disclose this code (IP and such), but it wasn't rocket science.