Getting values in Seaborn boxplot - boxplot

I would like to get the specific values by a boxplot generated in Seaborn
(i.e., media, quartile). For example, in the boxplot below (source: link)
Is there a any way to get the media and quartiles instead of manually estimation?
import numpy as np
import seaborn as sns
sns.set(style="ticks", palette="muted", color_codes=True)
# Load the example planets dataset
planets = sns.load_dataset("planets")
# Plot the orbital period with horizontal boxes
ax = sns.boxplot(x="distance", y="method", data=planets,
whis=np.inf, color="c")

I would encourage you to become familiar with using pandas to extract quantitative information from a dataframe. For instance, a simple thing you could to do to get the values you are looking for (and other useful ones) would be:
planets.groupby("method").distance.describe().unstack()
which prints a table of useful values for each method.
Or if you just want the median:
planets.groupby("method").distance.median()

Sometimes I use my data as a list of arrays instead of pandas. So for that, you might need:
min(d), np.quantile(d, 0.25), np.median(d), np.quantile(d, 0.75), max(d)

Related

Feature Selection in Multivariate Linear Regression

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
a = make_regression(n_samples=300,n_features=5,noise=5)
df1 = pd.DataFrame(a[0])
df1 = pd.concat([df1,pd.DataFrame(a[1].T)],axis=1,ignore_index=True)
df1.rename(columns={0:"X1",1:"X2",2:"X3",3:"X4",4:"X5",5:"Target"},inplace=True)
sns.heatmap(df1.corr(),annot=True);
Correlation Matrix
Now I can ask my question. How can I choose features that will be included in the model?
I am not that well-versed in python as I use R most of the time.
But it should be something like this:
# Create a model
model = LinearRegression()
# Call the .fit method and pass in your data
model.fit(Variables,Target)
# Or simply do
model = LinearRegression().fit(Variables,Target)
# So based on the dataset head provided, it should be
X<-df1[['X1','X2','X3','X4','X5']]
Y<-df1['Target']
model = LinearRegression().fit(X,Y)
In order to do feature selections. You need to run the model first. Then check for the p-value. Typically, a p-value of 5% (.05) or less is a good cut-off point. If the p-value crosses the upper threshold of .05, the variable is insignificant and you can remove it from your model. You will have to do this manually. You can also tell by looking from the correlation matrix to see which value has less correlation to the target. AFAIK, there are no libs with built-in functionality to do feature selection automatically. In the end, statistics are just numbers. It is up to humans to interpret the results.

Calculating transpose of a tensor in Paraview

I am required to calculate the following in Paraview:
How can I calculate the transpose used in the above formula ? Basically I would like to know how to calculate the transpose of a matrix in Paraview.
As suggested by #Nico Vuaille, you should make use of Numpy support in ParaView. Simply apply a Programmable Filter to the dataset of interest, and supply a script comparable to the following.
import numpy as np
u = inputs[0].PointData['Velocity']
# Calculate gradient here, say uGrad
output.PointData.append(uGrad, 'Gradient')
EDIT: I have actually tried to generate your calculation with one of my datasets and realised that my answer and comments are not so helpful. Therefore, this is what I would suggest now, which should work:
Load your dataset in ParaView
Apply a Gradient / Gradient Of Unstructured Dataset filter on your dataset and select the velocity field as the input field (I used Gradient Of Unstructured Dataset, from which you have the possibility to also directly work out both divergence and vorticity fields).
Apply a Programmable Filter filter to the resulting dataset you obtained from the previous step and supply the code below.
Script
import numpy as np
grad = inputs[0].PointData['Gradients']
omega = (grad - np.transpose(grad, axes=(0, 2, 1))) / 2
output.PointData.append(omega, 'Omega')
You should end up with another item in your ParaView pipeline that only contains the expected Omega.
EDIT 2: The input file is using the XMDF format. When loaded into ParaView, it is interpreted as a Multi-Block Dataset of Blocks. As a result, the code snippet provided to the Script argument of Programmable Filter has to be updated to:
import paraview.vtk.numpy_interface.dataset_adapter as dsa
for i in range(inputs[0].GetNumberOfBlocks()):
data = dsa.WrapDataObject(inputs[0].GetBlock(i))
grad = data.PointData['Gradients']
omega = (grad - np.transpose(grad, axes=(0, 2, 1))) / 2
data.PointData.append(omega, 'Omega')
output.SetBlock(i, data.VTKObject)
I think this can be easily computed using Python calculator (no need for programmable filter):
To compute the gradient, type:
gradient(u)
To compute the symmetric part of the tensor gradient(u):
strain(u)
To compute the non-symmetric part, Omega, of the gradient tensor:
gradient(u) - strain(u)
Note that that the gradient(u) tensor can be written as follows:

How to handle flag/exception values

In Paraview, I am working with a dataset that uses the value -99999 as a flag value. I'd like to be able to manipulate the dataset without these values causing issues with things like glyphs and colorbars. Nominally, I'd like the data to be "ignored".
A little about the data: I've got both scalar and vector point data, sitting on a fixed 2D spatial mesh at set temporal intervals.
Although -99999 is very far beyond the values the data might otherwise show, using a threshold filter isn't an option because the flag can occur at different places at different times. The way Paraview's threshold filter works means that the point ID to a fixed point in space will change as the number of filtered points changes through time.
In case it matters, data are in a netCDF file that is read in via an XMF header file and the XDMF Reader since the CF reader doesn't work (possibly because of my unstructured triangular mesh). The netCDF data have the _FillValue global attribute, however this doesn't appear to be getting picked up on by Paraview.
You could use a Programmable Filter to replace values below -99999 by NaN. Providing the data is not a vtkMultiblockDataSet, you can use the following script in the programmable filter :
import numpy as np
from vtk.numpy_interface import dataset_adapter as dsa
# name of the array
name = 'name'
# limit
limit = -99999
array = inputs[0].PointData[name].copy()
array[array<=limit] = np.nan
out = dsa.WrapDataObject(self.GetOutput())
out.PointData.append(array, name)
Note: if data of interest is a Cell Data, replace PointData by CellData in the script.
Note 2: the script was tested on ParaView 5.6.

Generating synthetic data of mixed type (numerical/categorical) in Matlab (or any other)

I'm trying to generate some synthetic data for experiments. When it comes to data sets with numerical features this is rather easy, I just use a Gaussian mixture (using Netlab, a package for Matlab) and that's done.
Noooww, I also need to generate some data sets with numerical and categorical features. The numerical part I can easily do using the above method, what about the categorical?
I was thinking to generate a categorical feature with (say) 3 categories with probabilities of 68.2% (+/- 1 sigma), 27.2% (between +/- 1 sigma and +/- 2 sigma), and 4.6% (the rest) within the objects with the same label.
And perhaps another categorical feature with 5 categories, with probabilities of 34.1%, 34.1%, 13.6%, 13.6%, 4.6% - again, within the objects with the same label.
Does that make sense to you guys? any thoughts?
I can easily write the code for the above, but if you know of any function that does it for me - please let me know.
Thanks!
It's easy to do in Python using numpy:
import numpy as np
np.random.multinomial(n=1, pvals=[.3,.3,.4], size=10)

Any good visualization program for my situation?

Suppose you have a hash table where keys are strings and values are integers. Do you have any idea to visualize the hash table as a bar graph or histogram such that x-axis represents the key strings and y-axis represents the range of values?
Thanks in advance!
I don't know exactly what you are looking for but this is something you could do. You can use python with matplotlib. The following piece of code might help you to achieve what you want.
import matplotlib.pyplot as plt
hashtable = {'key1':10 , 'key2':20 , 'key3':14}
plt.bar(range(len(hashtable)), hashtable.values(), align='center')
plt.xticks(range(len(hashtable)), hashtable.keys())
plt.show()
For that you need to install python and matplotlib.