how can i plot a histogram in pyspark - pyspark

I am new on pyspark , I have tabe as below, I want to plot histogram of this df , x axis will include “word” by axis will include “count” columns. Do you have any idea ?
word count
Akdeniz’in 14
en 13287
büyük 3168
deniz 1276
festivali: 6

First of all, a histogram is not the correct diagram typ to visualize a word count. Histograms are useful to visualize the distribution of a variable, bar charts in contrary are used to compare variables (Read this article for more information). With the following code you can create a barchart for your example:
from matplotlib import pyplot
l = [( 'Akdeniz’in', 14)
,('en' , 13287)
,('büyük' , 3168)
,('deniz' , 1276)
,('festivali:' , 6)]
df = spark.createDataFrame(l,['word','count'])
#Add values to a list (not recommend when you have a huge dataframe)
bla = df.collect()
#create a numeric value for every label
indexes = list(range(len(bla)))
#split words and counts to different lists
values = [r['count'] for r in bla]
labels = [r['word'] for r in bla]
#Plotting
bar_width = 0.35
pyplot.bar(indexes, values)
#add labels
labelidx = [i + bar_width for i in indexes]
pyplot.xticks(labelidx, labels)
pyplot.show()

Related

How to visualize rownames of specific data points within a cluster plot in R?

I have calculated clusters with a big dataset (1) and found four clusters which I plotted. Now I have 30 new data points (2) that I want to plot in/ on top of the existing clusters in order to see which of the new data points is closest to the original cluster centroids (of the 1. big dataset).
What I did so far:
#I have combined both data sets (1. my old big data set) and (2. my 30 new data points) and added an indicator variable in order to distinguish between the old and new data sets:
# I only chose variables that are needed for the cluster calculations as well as the indicator
combined.ind <- combined [, c(1752:1757, 1759:1762, 1942)]
#I created a factor variable that indicates "new' and old variables
combined.ind$indicator <- factor(combined.ind$indicator,
levels = c(0,1),
labels = c("new", "old"))
#Then I calculated a hierarchical cluster analysis with the ward-centroids which I have then used for calculating a k-means clustering:
#calculate ward-centroids:
combined.ward.cent <- aggregate(cbind(Z1, Z2, Z3, Z4, Z5, Z6, Z7, Z8, Z9, Z10)~CLU4_1,combined,mean)
combined.ward.cent2 <- combined.ward.cent[, c(2:11)]
#apply kmeans with ward centroids as initial starting points:
kmeans <- kmeans(combined.ind[1:(length(combined.ind)-1)], centers = combined.ward.cent2)
#Then I have plotted the results and tried to highlight the new data points:
#Plot the results
fviz_cluster(kmeans, data = combined.ind[, 1:length(combined.ind)-1])
#I changed the colors with scale color manual in order to see the new data points.
fviz_cluster(kmeans, data = combined.ind[, 1:length(combined.ind)-1], geom=c("point", "text"), ellipse = T) + geom_point(aes(color=combined.ind$indicator)) + ggtitle("My Beautiful Graph") +
scale_color_manual("Old vs New", values = c("new" = "black", "old" = "red"))
Since the first dataset is huge, I cannot see/read the rownames of the new data points because all of them overlap. When I add repel=True to the argument (see below) only the rownames of the data points on the edge are visualized, which does not help me because I am trying to only visualize the rownames of the new data points.
fviz_cluster(kmeans, data = combined.ind[, 1:length(combined.ind)-1], geom=c("point", "text"), repel = TRUE, ellipse = T) +
geom_point(aes(color=combined.ind$indicator)) + ggtitle("My Beautiful Graph") +
scale_color_manual("Old vs New", values = c("new" = "black", "old" = "red"))
How can I solve this problem?

Draw Networkx Directed Graph using clustering labels as color scheme

I need help drawing a networkx directed graph. I have a directed graph which I create from a dataframe that looks as the following:
source target weight
ip_1 ip_2 3
ip_1 ip_3 6
ip_4 ip_3 7
.
.
.
Afterwards, I have clustered this graph using elbow+kmeans, after converting the nodes into embeddings using Node2Vec:
https://github.com/eliorc/node2vec
At the end, I have this resulting dataframe:
source target weight source_kmeans_label target_kmeans_label elbow_optimal_k
ip_1 ip_2 3 0 1 12
ip_1 ip_3 6 2 0 12
ip_4 ip_3 7 0 3 12
.
.
.
I want to visualize (draw) this graph (source, target, weight) using different colors based on the elbow value; so for the example above, I will have 12 different colors. I really appreciate any help to achieve this, thanks.
You can use a seaborn palette to generate 12 different RGB color values and then create a column called color in your dataframe based on the weight values:
import seaborn as sns
import networkx as nx
from pyvis.network import Network
palette = sns.color_palette("husl", n_colors=12) # n_colors is your elbow value
assuming you dataframe is called df, you can add the new column color based on weight column as follows:
df['color'] = df.apply(lambda row: palette[row['weight'] - 1], axis=1)
Now that you have an RGB value for each edge, first you need to make your graph from the dataframe and then you can visualize the graph using pyvis:
G = nx.from_pandas_edgelist(df, 'source', 'target', edge_attr='color', create_using=nx.DiGraph())
N = Network(height='100%', width='100%', bgcolor='white', font_color='black', directed=True)
for n in G.nodes:
N.add_node(n)
for e, attrs in G.edges.data():
N.add_edge(e[0], e[1], color=attrs['color'])
N.write_html('path/to/your_graph.html')

Applying scipy.stats.gaussian_kde to 3D point cloud

I have a set of about 33K (x,y,z) points in a csv file and would like to convert this to a grid of density values using scipy.stats.gaussian_kde. I have not been able to find a way to convert this point cloud array into an appropriate input format for the gaussian_kde function (and then take the output of this and convert it into a density value grid). Can anyone provide sample code?
Here's an example with some comments which may be of use. gaussian_kde wants the data and points to be row stacked, ie. (# ndim, # num values), as per the docs. In your case you would row_stack([x, y, z]) such that the shape is (3, 33000).
from scipy.stats import gaussian_kde
import numpy as np
import matplotlib.pyplot as plt
# simulate some data
n = 33000
x = np.random.randn(n)
y = np.random.randn(n) * 2
# data must be stacked as (# ndim, # n values) as per docs.
data = np.row_stack((x, y))
# perform KDE
kernel = gaussian_kde(data)
# create grid over which to evaluate KDE
s = np.linspace(-8, 8, 128)
grid = np.meshgrid(s, s)
# again KDE needs points to be row_stacked
grid_points = np.row_stack([g.ravel() for g in grid])
# evaluate KDE and reshape result correctly
Z = kernel(grid_points)
Z = Z.reshape(grid[0].shape)
# plot KDE as image and overlay some data points
fig, ax = plt.subplots()
ax.matshow(Z, extent=(s.min(), s.max(), s.min(), s.max()))
ax.plot(x[::10], y[::10], 'w.', ms=1, alpha=0.3)
ax.set_xlim(s.min(), s.max())
ax.set_ylim(s.min(), s.max())

Matlab boxplot adjacent values

I found that calculating an index to specify outliers of a dataset according to how the boxplot works does not give the same results. Please find below an example where I create some data, extract the values from the boxplot (as seen in datatips in the figure window) and compare them to the values I calculated.
While the median and quartiles match up the upper and lower adjacent values do not. According to the Matlab help under 'Whisker', the adjacent values are calculated as q3 + w*(q3-q1) where q3 and q1 are the quantiles and w is the specified whisker length.
Am I calculating this wrong or is there any other mistake? I would like to be able to explain the error.
Screenshot of results table (please note the results vary due to random data)
%Boxplot test
% create random, normally distributed dataset
data = round(randn(1000,1)*10,2);
figure(10)
clf
boxplot(data,'Whisker',1.5)
clear stats tmp
% read data from boxplot, same values as can be seen in datatips in the figure window
h = findobj(gcf,'tag','Median');
tmp = get(h,'YData');
stats(1,1) = tmp(1);
h = findobj(gcf,'tag','Box');
tmp = get(h,'YData');
stats(1,2) = tmp(1);
stats(1,3) = tmp(2);
h = findobj(gcf,'tag','Upper Adjacent Value');
tmp = get(h,'YData');
stats(1,4) = tmp(1);
h = findobj(gcf,'tag','Lower Adjacent Value');
tmp = get(h,'YData');
stats(1,5) = tmp(1);
% calculated data
stats(2,1) = median(data);
stats(2,2) = quantile(data,0.25);
stats(2,3) = quantile(data,0.75);
range = stats(2,3) - stats(2,2);
stats(2,4) = stats(2,3) + 1.5*range;
stats(2,5) = stats(2,2) - 1.5*range;
% error calculation
for k=1:size(stats,2)
stats(3,k) = stats(2,k)-stats(1,k);
end %for k
% convert results to table with labels
T = array2table(stats,'VariableNames',{'Median','P25','P75','Upper','Lower'}, ...
'RowNames',{'Boxplot','Calculation','Error'});
While the calculation of the boundaries, e.g. q3 = q3 + w*(q3-q1), is correct, it is not displayed in the boxplot. What is actually displayed and marked as upper/lower adjacent value is the minimum and maximum of the values within the aforementioned boundaries.
Regarding the initial task leading to the question: For applying the same filtering of outliers as in the boxplot the calculated boundaries can be used.

Plot portfolio composition map in Julia (or Matlab)

I am optimizing portfolio of N stocks over M levels of expected return. So after doing this I get the time series of weights (i.e. a N x M matrix where where each row is a combination of stock weights for a particular level of expected return). Weights add up to 1.
Now I want to plot something called portfolio composition map (right plot on the picture), which is a plot of these stock weights over all levels of expected return, each with a distinct color and length (at every level of return) is proportional to it's weight.
My questions is how to do this in Julia (or MATLAB)?
I came across this and the accepted solution seemed so complex. Here's how I would do it:
using Plots
#userplot PortfolioComposition
#recipe function f(pc::PortfolioComposition)
weights, returns = pc.args
weights = cumsum(weights,dims=2)
seriestype := :shape
for c=1:size(weights,2)
sx = vcat(weights[:,c], c==1 ? zeros(length(returns)) : reverse(weights[:,c-1]))
sy = vcat(returns, reverse(returns))
#series Shape(sx, sy)
end
end
# fake data
tickers = ["IBM", "Google", "Apple", "Intel"]
N = 10
D = length(tickers)
weights = rand(N,D)
weights ./= sum(weights, dims=2)
returns = sort!((1:N) + D*randn(N))
# plot it
portfoliocomposition(weights, returns, labels = tickers)
matplotlib has a pretty powerful polygon plotting capability, e.g. this link on plotting filled polygons:
ploting filled polygons in python
You can use this from Julia via the excellent PyPlot.jl package.
Note that the syntax for certain things changes; see the PyPlot.jl README and e.g. this set of examples.
You "just" need to calculate the coordinates from your matrix and build up a set of polygons to plot the portfolio composition graph. It would be nice to see the code if you get this working!
So I was able to draw it, and here's my code:
using PyPlot
using PyCall
#pyimport matplotlib.patches as patch
N = 10
D = 4
weights = Array(Float64, N,D)
for i in 1:N
w = rand(D)
w = w/sum(w)
weights[i,:] = w
end
weights = [zeros(Float64, N) weights]
weights = cumsum(weights,2)
returns = sort!([linspace(1,N, N);] + D*randn(N))
##########
# Plot #
##########
polygons = Array(PyObject, 4)
colors = ["red","blue","green","cyan"]
labels = ["IBM", "Google", "Apple", "Intel"]
fig, ax = subplots()
fig[:set_size_inches](5, 7)
title("Problem 2.5 part 2")
xlabel("Weights")
ylabel("Return (%)")
ax[:set_autoscale_on](false)
ax[:axis]([0,1,minimum(returns),maximum(returns)])
for i in 1:(size(weights,2)-1)
xy=[weights[:,i] returns;
reverse(weights[:,(i+1)]) reverse(returns)]
polygons[i] = matplotlib[:patches][:Polygon](xy, true, color=colors[i], label = labels[i])
ax[:add_artist](polygons[i])
end
legend(polygons, labels, bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0)
show()
# savefig("CompositionMap.png",bbox_inches="tight")
Can't say that this is the best way, to do this, but at least it is working.