How can I get the numbers for the correlation matrix from Pandas Profiling - pandas-profiling

I really like the heatmap, but what I need are the numbers behind the heatmap (AKA correlation matrix).
Is there an easy way to extract the numbers?

It was a bit hard to track down but starting from the documentation; specifically
from the report structure then digging into the following function get_correlation_items(summary) and then going into the source and looking at the usage of it we get to this call that essentially loops over each of the correlation types in the summary, to obtain the summary object we can find the following, if we lookup the caller we find that it is get_report_structure(summary) and if we try to find how to get the summary arg we find that it is simply the description_set property as shown here.
Given the above, we can now do the following using version 2.9.0:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.DataFrame(
np.random.rand(100, 5),
columns=["a", "b", "c", "d", "e"]
)
profile = ProfileReport(df, title="StackOverflow", explorative=True)
correlations = profile.description_set["correlations"]
print(correlations.keys())
dict_keys(['pearson', 'spearman', 'kendall', 'phi_k'])
To see a specific correlation do:
correlations["phi_k"]["e"]
a 0.000000
b 0.112446
c 0.289983
d 0.000000
e 1.000000
Name: e, dtype: float64
Sample Notebook

Related

Can I draw a bipartite graph from every dataset?

I am trying to draw a bipartite graph for my data set, which is like below:
source target weight
reduce energy 25
reduce consumption 25
energy pennsylvania 4
energy natural 4
consumption balancing 4
the code That I am trying to plot the graph is as below:
C_2021 = nx.Graph()
C_2021.add_nodes_from(df_final_2014['source'], bipartite=0)
C_2021.add_nodes_from(df_final_2014['target'], bipartite=1)
edges = df_final_2014[['source', 'target','weight']].apply(tuple, axis=1)
C_2021.add_weighted_edges_from(edges)
But when I check with the below code whether it is bipartite or not, I get the "False" feedback.
nx.is_bipartite(C_2021)
Could you please advise what the issue is?
The previous issue is resolved, but when I want to plot the bipartite graph with the below steps, I do not get a proper result. If someone could help me, I will be appreciated it:
top_nodes_2021 = set(n for n,d in C_2021.nodes(data=True) if d['bipartite']==0)
top_nodes_2021
the output of the above is:
{'reduce'}
bottom_nodes_2021 = set(C_2021) - top_nodes_2021
bottom_nodes_2021
the output of the above is:
{'balancing', 'consumption', 'energy', 'natural', 'pennsylvania '}
then plot it by:
pos = nx.bipartite_layout(C_2021,top_nodes_2021)
plt.figure(figsize=[8,6])
# Pass that layout to nx.draw
nx.draw(C_2021,pos,node_color='#A0CBE2',edge_color='black',width=0.2,
edge_cmap=plt.cm.Blues,with_labels=True)
and the result is:
It works for me using your code. nx.is_bipartite(C_2021) returns true. Check the example below:
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
import pandas as pd
data = StringIO('''source;target;weight
reduce;energy;25
reduce;consumption;25
energy;pennsylvania ;4
energy;natural;4
consumption;balancing;4
''')
df_final_2014 = pd.read_csv(data, sep=";")
C_2021 = nx.Graph()
C_2021.add_nodes_from(df_final_2014['source'], bipartite=0)
C_2021.add_nodes_from(df_final_2014['target'], bipartite=1)
edges = df_final_2014[['source', 'target','weight']].apply(tuple, axis=1)
C_2021.add_weighted_edges_from(edges)
nx.is_bipartite(C_2021)
Finally to draw them get the bipartite sets. The data you passed during the creation is false (i.g. bipartite=0 and bipartite=1).
Use the following commands:
from networkx.algorithms import bipartite
top_nodes_2021, bottom_nodes_2021 = bipartite.sets(C_2021)
pos = nx.bipartite_layout(C_2021, top_nodes_2021)
plt.figure(figsize=[8,6])
# Pass that layout to nx.draw
nx.draw(C_2021,pos,node_color='#A0CBE2',edge_color='black',width=0.2,
edge_cmap=plt.cm.Blues,with_labels=True)
With the following result:

PySpark approxSimilarityJoin() not returning any results

I am trying to find similar users by vectorizing user features and sorting by distance between user vectors in PySpark. I'm running this in Databricks on Runtime 5.5 LTS ML cluster (Scala 2.11, Spark 2.4.3)
Following the code in the docs, I am using approxSimilarityJoin() method from the pyspark.ml.feature.BucketedRandomProjectionLSH model.
I have found similar users successfully using approxSimilarityJoin(), but every now and then I come across a user of interest that apparently has no users similar to them.
Usually when approxSimilarityJoin() doesn't return anything, I assume it's because the threshold parameter is set to low. That fixes the issue sometimes, but now I've tried using a threshold of 100000 and still getting nothing back.
I define the model as
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=1.0)
I'm not sure if I changing bucketLength or numHashTables would help in obtaining results.
The following example shows a pair of users where approxSimilarityJoin() returned something (dataA, dataB) and a pair of users (dataC, dataD) where it didn't.
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
dataA = [(0, Vectors.dense([0.7016968702094931,0.2636417660310031,4.155293362824633,4.191398632883099]),)]
dataB = [(1, Vectors.dense([0.3757117100334294,0.2636417660310031,4.1539923630906745,4.190086328785612]),)]
dfA = spark.createDataFrame(dataA, ["customer_id", "scaledFeatures"])
dfB = spark.createDataFrame(dataB, ["customer_id", "scaledFeatures"])
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=2.0,
numHashTables=3)
model = brp.fit(dfA)
# returns
# theshold of 100000 is clearly overkill
# A dataframe with dfA and dfB feature vectors and a EuclideanDistance of 0.32599039770730354
model.approxSimilarityJoin(dfA, dfB, 100000, distCol="EuclideanDistance").show()
dataC = [(0, Vectors.dense([1.1600056435954367,78.27652460873155,3.5535837780801396,0.0030949620591871887]),)]
dataD = [(1, Vectors.dense([0.4660731192450482,39.85571715054726,1.0679201943112886,0.012330725745062067]),)]
dfC = spark.createDataFrame(dataC, ["customer_id", "scaledFeatures"])
dfD = spark.createDataFrame(dataD, ["customer_id", "scaledFeatures"])
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=2.0,
numHashTables=3)
model = brp.fit(dfC)
# returns empty df
model.approxSimilarityJoin(dfC, dfD, 100000, distCol="EuclideanDistance").show()
I was able to obtain results to the second half of the example above by increasing the bucketLength parameter value to 15. The threshold could have been lowered because the Euclidean Distance was ~34.
Per the PySpark docs:
bucketLength = the length of each hash bucket, a larger bucket lowers the false negative rate

scipy stats skewness is not correctly provide skewness results

I noticed that the skewness returned from scipy stats is not correct. Pandas.skew() actually provide better results.
I am recently trying to duplicate a classic paper, Expected Stock Returns and Volatility by French&Schwert. I use S&P500 data from 1928 to 1984. I follow the formula in the paper for standard deviation of the return and I am able to get the same result for mean, std dev of std dev.
However, when I use scipy.stats.skew function, I can't not get any number of the std dev of the sp return. The function return "nan", where clearly it should return a value.
I switch to Pandas.skew(). it returned me the correct value as in the paper.
Clearly, something is wrong with the scipy.stats.skew() function.
scipy.stats.skew()
pandas.skew()
Results by Scipy.stats.skew()
['Adj Close_gspc', 'Adj Close_gspc_lag', 'SP_Return', 'SP_Return_square',
'SP_Return_lag', 'SP_varianceMon', 'SP_varianceMon_sqrRoot']
array([ 0.6922229 , 0.69186265, -0.11292165, 4.23571807, -1.9556035 ,
5.39873607, nan])
results by pandas:
Adj Close_gspc 0.693745
Adj Close_gspc_lag 0.693384
SP_Return -0.113170
SP_Return_square 4.245033
SP_Return_lag -1.959904
SP_varianceMon 5.410609
SP_varianceMon_sqrRoot 2.800919
dtype: float64
You haven't provided enough information or sample code to reproduce the nan that you get.
To make scipy.stats.skew compute the same value as the skew() method in Pandas, add the argument bias=False.
Here's an example.
First, the imports:
In [21]: import numpy as np
In [22]: import pandas as pd
In [23]: from scipy.stats import skew
Generate some data:
In [24]: np.random.seed(8675309)
In [25]: x = np.random.weibull(0.2, size=15)
Compute the skew with scipy and with Pandas:
In [26]: skew(x, bias=False)
Out[26]: 3.7582525674514544
In [27]: pd.Series(x).skew()
Out[27]: 3.7582525674514544

Please help debug my call scipy library for Kolmogorov-Smirnov Test

I am completing an assignment but can not get the right results from a kolmogorov smirnov test for a small sample of observations against a 'norm' distribution.
I have setup a minimal sample in a jupyter notebook with expected kstest results and tried running this in several environment and reviewed the call for hours. Answer key says my ks_value and p_value are wildly wrong.
But, I cannot see my error.
the sample I have is from the test run in the answer key. it is a 1d array, a valid input option.
sample mean and standard deviation I compute look right
if I change ddof it makes a small difference (hint is to use ddof=0)
norm is a valid distribution for the kstest
library documentation is at
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.kstest.html#scipy-stats-kstest
Any ideas or comments?
Would you expect a sample = [0.37, 0.27, 0.69, 0.56, 0.26] compared to a normal distribution to have
'KS test statistic' of 0.64 or 0.24
and
'p-value' of 0.02 or 0.94
TIA
import pandas as pd
import numpy as np
from scipy.stats import kstest
sample = [0.37, 0.27, 0.69, 0.56, 0.26]
normal_args = (np.mean(sample), np.std(sample, ddof=0))
print('mean', normal_args[0])
print('std', normal_args[1])
ks_value, p_value = kstest(sample, 'norm', normal_args )
print('ks_value', ks_value)
print('p_value', p_value)
print('')
print('#####posted solution')
print('expected ks_value = 0.63919407')
print('expected p_value = 0.01650327')
mean 0.43000000000000005
std 0.1688786546606764
ks_value 0.23881183701141995
p_value 0.9379686201081335
####posted solution
expected ks_value = 0.63919407
expected p_value = 0.01650327
My bad. A new guy mistake.
The function calls defines the 3rd argument as "args=()". I had put the 3rd argument in treating the input as a positional. Changing the call to
ks_value, p_value = kstest(sample, 'norm', args=(normal_args) )
yields the correct response.

Low alpha for NLTK agreement using MASI distance

I'm getting a very low value for Krippendorff's alpha when I calculate agreement in NLTK using MASI as the distance function.
Three coders (Inky, Blinky, and Sue) are instructed to assign topic labels (love, gifts, slime, or gaming) to two texts (text01 and text02), based on what the texts are about. Each text can be about more than one topic, so coders may assign each text more than one label. The data and the code used to make the calculatons are shown below:
import nltk
from nltk.metrics import agreement
from nltk.metrics.distance import masi_distance
from nltk.metrics.distance import jaccard_distance
#(coder, item, label)
data = [('inky','text01',frozenset(['love','gifts'])),
('blinky','text01',frozenset(['love','gifts'])),
('sue','text01',frozenset(['love','gifts'])),
('inky','text02',frozenset(['slime','gaming'])),
('blinky','text02',frozenset(['slime'])),
('sue','text02',frozenset(['slime','gaming']))]
jaccard_task = nltk.AnnotationTask(distance=jaccard_distance)
masi_task = nltk.AnnotationTask(distance=masi_distance)
tasks = [jaccard_task, masi_task]
for task in tasks:
task.load_array(data)
print("Statistics for dataset using {}".format(task.distance))
print("C: {}\nI: {}\nK: {}".format(task.C, task.I, task.K))
print("Pi: {}".format(task.pi()))
print("Kappa: {}".format(task.kappa()))
print("Multi-Kappa: {}".format(task.multi_kappa()))
print("Alpha: {}".format(task.alpha()))
print()
When I run the code, I get the following results:
Statistics for dataset using <function jaccard_distance at 0x09D26DB0>
C: {'inky', 'sue', 'blinky'}
I: {'text01', 'text02'}
K: {frozenset({'slime'}), frozenset({'love', 'gifts'}), frozenset ({'gaming', 'slime'})}
Pi: 0.7272727272727273
Kappa: 0.7777777777777777
Multi-Kappa: 0.7499999999999999
Alpha: 0.75
Statistics for dataset using <function masi_distance at 0x09D26DF8>
C: {'inky', 'sue', 'blinky'}
I: {'text01', 'text02'}
K: {frozenset({'slime'}), frozenset({'love', 'gifts'}), frozenset({'gaming', 'slime'})}
Pi: 0.8172727272727272
Kappa: 0.8511111111111113
Multi-Kappa: 0.8324999999999998
Alpha: -1.5
My question is, why is the alpha so low when using the MASI distance function compared to Jaccard?
I was unable to reproduce the error and got the correct value of Krippendorff's alpha with MASI distance when running the provided code. I used Python 3.5.2, NumPy 1.18.2, NLTK 3.4.5. Thus, the most probable answer would be that one need to update NLTK.