TfidfVectorizer processes first document only - tfidfvectorizer

I'm getting TD-IDF weights only on the 1st document in the list. The rest are zeros!
tfidf_vectorizer = TfidfVectorizer()
X = tfidf_vectorizer.fit_transform(docs)
print(tfidf_vectorizer.idf_)
pd.DataFrame(X[0].T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"]).sort_values(by=["tfidf"],ascending=False).head(10)
For list:
docs=[
"the world dog",
"the cat hello",
"the foo hello",
]
The output
[1.69314718 1.69314718 1.69314718 1.28768207 1. 1.69314718]
tfidf
dog 0.652491
world 0.652491
the 0.385372
cat 0.000000
foo 0.000000
hello 0.000000
After swapping the first two lines
docs=[
"the cat hello",
"the world dog",
"the foo hello",
]
The output becomes:
[1.69314718 1.69314718 1.69314718 1.28768207 1. 1.69314718]
tfidf
cat 0.720333
hello 0.547832
the 0.425441
dog 0.000000
foo 0.000000
world 0.000000
Can anyone offer insight into this issue?

it might be because when you're making a DataFrame you're only using the first value of the TF-IDF's sparse matrix X[0].T.todense() should be X.T.todense()
Also you'd do better if you kept the fitted TFIDF into another variable so you can transform new incoming data for testing:
tfidf_vectorizer = TfidfVectorizer().fit(docs)
X = tfidf_vectorizer.transform(docs)
print(tfidf_vectorizer.idf_)
Now, since you don't have 1 line, you can't transpose and state only 1 column, you'll need to convert to an array an set the columns as the tf-idf possible values. The indes will be numeric.
df = pd.DataFrame(X.toarray(),
columns=tfidf_vectorizer.get_feature_names())
print(df.head(10)) # or display() if you're in jupyter

Related

How can I get the numbers for the correlation matrix from Pandas Profiling

I really like the heatmap, but what I need are the numbers behind the heatmap (AKA correlation matrix).
Is there an easy way to extract the numbers?
It was a bit hard to track down but starting from the documentation; specifically
from the report structure then digging into the following function get_correlation_items(summary) and then going into the source and looking at the usage of it we get to this call that essentially loops over each of the correlation types in the summary, to obtain the summary object we can find the following, if we lookup the caller we find that it is get_report_structure(summary) and if we try to find how to get the summary arg we find that it is simply the description_set property as shown here.
Given the above, we can now do the following using version 2.9.0:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.DataFrame(
np.random.rand(100, 5),
columns=["a", "b", "c", "d", "e"]
)
profile = ProfileReport(df, title="StackOverflow", explorative=True)
correlations = profile.description_set["correlations"]
print(correlations.keys())
dict_keys(['pearson', 'spearman', 'kendall', 'phi_k'])
To see a specific correlation do:
correlations["phi_k"]["e"]
a 0.000000
b 0.112446
c 0.289983
d 0.000000
e 1.000000
Name: e, dtype: float64
Sample Notebook

K-means clustering error: only 0's may be mixed with negative subscripts

I am trying to do kmeans clustering on IRIS data in R. I want to use KKZ option for the seed selection (starting points of clusters).
If i dont standardize the data i have no issues with the KKZ command:
library(inaparc)
res<- kkz(x=iris[,1:4], k=3)
seed <- res$v # this gives me the cluster seeds based on KKZ method
k1 <- kmeans(iris[,1:4], seed, iter.max=1000)
However, when i scale the data first, then kkz command gives me the error:
library(ClusterR)
dat <- center_scale(iris[1:4], mean_center = TRUE, sd_scale = TRUE) # scale iris data
res2 <- kkz(x=dat, k=3)
**Error in x[-x[i, ], ] : only 0's may be mixed with negative subscripts**
I think this is an array indexing thing but not sure what it is and how to solve it.
For some reason, kkz cannot take in anything with a mixture of positive and negative values. I have a lot of problems running it, for example:
#ok
set.seed(1000)
kkz(matrix(rnorm(1000,5,1),100,10),3)
# not ok
kkz(matrix(rnorm(1000,0,1),100,10),3)
Error in x[-x[i, ], ] : only 0's may be mixed with negative subscripts
You don't really need to center your values, so you can do:
dat <- center_scale(iris[1:4], mean_center = FALSE, sd_scale = TRUE)
res2 <- kkz(x=dat, k=3)
I would be quite cautious about using this package..until you figure out why it is so..

Optimal String comparison method swift

What is the best algorithm to use to get a percentage similarity between two strings. I have been using Levenshtein so far, but it's not sufficient. Levenshtein gives me the number of differences, and then I have to try and compute that into a similarity by doing:
100 - (no.differences/no.characters_in_scnd_string * 100)
For example, if I test how similar "ab" is to "abc", I get around 66% similarity, which makes sense, as "ab" is 2/3 similar to "abc".
The problem I encounter, is when I test "abcabc" to "abc", I get a similarity of 100%, as "abc" is entirely present in "abcabc". However, I want the answer to be 50%, because 50% of "abcabc" is the same as "abc"...
I hope this makes some sense... The second string is constant, and I want to test the similairty of different strings to that string. By similar, I mean "cat dog" and "dog cat" have an extremely high similarity despite difference in word order.
Any ideas?
This implement of algorithms of Damerau–Levenshtein distance and Levenshtein distance
you can check this StringMetric Algorithms have what you need
https://github.com/autozimu/StringMetric.swift
Using Levenstein algorithm with input:
case1 - distance(abcabc, abc)
case2 - distance(cat dog, dog cat)
Output is:
distance(abcabc, abc) = 3 // what is ok, if you count percent from `abcabc`
distance(cat dog, dog cat) = 6 // should be 0
So in the case of abcabc and abc we are getting 3 and it is 50% of the largest word abcabc. exactly what you want to achive.
The second case with cats and dogs: my suggestion is to split this Strings to words and compare all possible combinations of them and chose the smallest result.
UPDATE:
The second case I will describe with pseudo code, because I'm not very familiar with Swift.
get(cat dog) and split to array of words ('cat' , 'dog') //array1
get(dog cat) and split to array of words ('dog' , 'cat') //array2
var minValue = 0;
for every i-th element of `array1`
var temp = maxIntegerValue // here will be storred all results of 'distance(i, j)'
index = 0 // remember index of smallest temp
for every j-th element of `array2`
if (temp < distance(i, j))
temp = distance(i, j)
index = j
// here we have found the smallest distance(i, j) value of i in 'array2'
// now we should delete current j from 'array2'
delete j from array2
//add temp to minValue
minValue = minValue + temp
Workflow will be like this:
After first iteration on first for statement (for value 'cat' array1) we will get 0, because i = 0 and j = 1 are identic. Then j = 1 will be removed from array2 and after that array2 will have only elem dog.
Second iteration on second for statement (for value 'dog' array1) we will get also 0, because it is identic with dog from array2
At least from now you have an idea how to deal with your problem. It is now depends on you how exactly you will implement it, probably you will take another data structure.

pure data [hist] implementation

No idea how to use [hist] in Pure Data.
And the three arguments of [hist] is:
the value of first class,
the value of last class,
the number of classes.
I cannot figure out the first and second argument meaning? And how am I going to pass the output of [hist] to [tabwrite] and generate an array diagram in Pure Data.
It seems you are using the [hist] object from smlib.
The histogram will contain <number of classes> bins of equal size, with the first bin being equivalent to the <value of first class> and the last bin being equivalent to <value of last class>-1 (the offset is arguably a bug).
So, the value of first class is the minimum expected input value (x>=min), and the value of last class is the maximum expected input value (x<<max).
Any input value exceeding those boundaries will be clipped.
Examples:
[3, absolute(
|
[hist 2 5 3]
|
[print]
This will create a 3-bin histogram, with the bins 2±0.5 (with clipping this means x<2.5), 3±0.5 and 4±0.5 (with clipping that is 3.5<x).
The input 3 will be filed into the second bin, so the absolute histogram is 0 1 0.
Similarily:
[3, absolute(
|
[hist 3 6 3]
|
[print]
This will create a 3-bin histogram, with the bins 3±0.5, 4±0.5 and 5±0.5.
The input 3 will be now filed into the first bin, so the absolute histogram is 1 0 0.
Displaying the histogram:
You can set the table-values by sending a list of number to the table, prefixed with the starting index:
[relative(
|
[hist 0 100 100]
|
[list prepend 0]
|
[s $0-histo]
[table $0-histo 100]
Alternatively check the [array] object (which also can be accessed via [tabread] and the like)

How to use caffe to classify text?

I'm using the Rotten Tomatoes dataset to train my net. It's divided in two groups, positive and negative examples. How can I configure my cnn in caffe to predict if a given text is a positive or a negative example?
I already formatted the data, each sentence has a size of 56 words. But using the following config does not give me even a satisfactory result.
n = caffe.NetSpec()
n.data, n.label = L.Data(batch_size=batch_size, backend=P.Data.LMDB,
source=db_path,
transform_param=dict(scale= 1 / mean),
ntop=2)
n.conv1 = L.Convolution(n.data, kernel_size=3, pad=1,
param=dict(lr_mult=1), num_output=10,
weight_filler=dict(type='xavier'))
n.pool1 = L.Pooling(n.conv1, kernel_size=n_classes,
stride=2, pool=P.Pooling.MAX)
n.ip1 = L.InnerProduct(n.pool1, num_output=100,
weight_filler=dict(type='xavier'))
n.relu1 = L.ReLU(n.ip1, in_place=True)
n.ip2 = L.InnerProduct(n.relu1, num_output=n_classes,
weight_filler=dict(type='xavier'))
n.loss = L.SoftmaxWithLoss(n.ip2, n.label)
My dataset is divided in two text files. One containing the positives examples and other containing negatives examples. Polarity dataset v1.1. To organize my data I get the length of the biggest sentence (59 words) so if a sentence is smaller than 59 words I add some text to it. I adapted from this code. For example, lets pretend that the biggest sentence has 3 words:
data = 'abc def ghijkl. mnopqrst uvwxyz. abcd.'
##
#In this data I have 3 sentences:
##
sentence_one = 'abc def ghijkl
sentence_two = 'mnopqrst uvwxyz'
sentence_three = 'abcd'
The sentence_one is the biggest (3 words), so to format the others two sentence I did the following:
sentence_two = 'mnopqrst uvwxyz <PAD>'
sentence_three = 'abcd <PAD> <PAD>'
Saved each positive and negative sentence to a caffe datum and saved in lmdb:
datum = caffe.proto.caffe_pb2.Datum()
datum.channels = 1
datum.height = 59 #biggest sentence
datum.width = 1
datum.label = label # 0 or 1
datum.data = sentence.tobytes()
Using my datum database and the above caffe's configuration I get a poor accuracy (less than 3 percent). What am I doing wrong?