What is 'ps' in pyspark? I need help plotting histograms and boxplots - pyspark

In the ApacheSpark website, it says that to plot a boxplot, I need to do
df = ps.DataFrame(data, columns=list('ABCD'))
Similarly, for histogram, it says I need to do
df = ps.from_pandas(df)
df.plot.hist(bins=12, alpha=0.5)
But when I type in ps, it returns an error. So what is ps?

This ps is the shortform of you pandas that we have in pyspark it converts the data into pandas dataframe
you can import it like this
import pyspark.pandas as ps
and here is the use case of that
>>> import pyspark.pandas as ps
>>>
>>> psdf = ps.range(10)
>>> pdf = psdf.to_pandas()
>>> pdf.values
array([[0],
[1],
[2],
[3],
[4],
[5],
[6],
[7],
[8],
[9]])
further you can visit this site for help
https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/pandas_pyspark.html

Related

why is sklearn giving me an value error in train_test_split

ValueError: Expected 2D array, got 1D array instead:
array=[712. 3.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
```
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler
import seaborn as sb
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
df=sb.load_dataset('titanic')
df2=df[['survived','pclass','age','parch']]
df3=df2.fillna(df2.mean())
x=df3.drop('survived',axis=1)
y=df3['survived']
x_train,y_train,x_test,y_test=train_test_split(x,y,test_size=0.2, random_state=51)
print('x_train',x_train.shape)
sc=StandardScaler()
sc.fit(x_train.shape)
x_train=x_train.reshape(-1,1)
x_train_sc=sc.transform(x_train)
x_test_sc=sc.transform(x_test)
print(x_train_sc)`
```
```
`I would really appreciate if could fid me a solution
I have applied train_test_split to x & y variables and also transformed it into the x_train variabel. I was trying to print x_train. But it showed me an error
`
```
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[712. 3.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
```
`
You're supposed to give your StandardScaler your X_train and not the shape of your X_train :)
sc=StandardScaler()
sc.fit(x_train)
x_train_sc=sc.transform(x_train)
x_test_sc=sc.transform(x_test)
If you want to normalize your data in a -1/1 range, it's better to use MinMaxScaler :
from sklearn.preprocessing import MinMaxScaler
...
sc = MinMaxScaler(feature_range=(-1, 1)).fit(X_train)
x_train_sc=sc.transform(x_train)
x_test_sc=sc.transform(x_test)

How can I reuse the dataframe and use alternative for iloc to run an iterative imputer in Azure databricks

I am running an iterative imputer in Jupyter Notebook to first mark the known incorrect values as "Nan" and then run the iterative imputer to impute the correct values to achieve required sharpness in the data. The sample code is given below:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np
import pandas as p
idx = [761, 762, 763, 764]
cols = ['11','12','13','14']
def fit_imputer():
for i in range(len(idx)):
for col in cols:
dfClean.iloc[idx[i], col] = np.nan
print('Index = {} Col = {} Defiled Value is: {}'.format(idx[i], col, dfClean.iloc[idx[i], col]))
# Run Imputer for Individual row
tempOut = imp.fit_transform(dfClean)
print("Imputed Value = ",tempOut[idx[i],col] )
dfClean.iloc[idx[i], col] = tempOut[idx[i],col]
print("new dfClean Value = ",dfClean.iloc[idx[i], col])
origVal.append(dfClean_Orig.iloc[idx[i], col])
I get an error when I try to run this code on Azure Databricks using pyspark or scala. Because the dataframes in spark are immutable also I cannot use iloc as I have used it in pandas dataframe.
Is there a way or better way of implementing such imputation in databricks?

How to generate / split data randomly in PySpark

The following line of Scala code in Apache Spark will split data randomly across 8 partition:
import org.apache.spark.sql.functions.rand
df
.repartition(8, col("person_country"), rand)
.write
.partitionBy("person_country")
.csv(outputPath)
Can someone show me how to do the equivalent with PySpark? I have attempted it myself with the following code, but it fails
from pyspark.sql.functions import rand
df\
.repartition(8, col("person_country"), rand)\
.write.partitionBy("person_country")\
.format('csv').mode('Overwrite')\
.save("outputPath")
Any thoughts?
repartition(8, col("person_country"), rand()) with parenthesis after rand

spark.read.format('libsvm') not working with python

I am learning PYSPARK and encountered a problem that I can't fix. I followed this video to copy codes from the PYSPARK documentation to load data for linear regression. The code I got from the documentation was spark.read.format('libsvm').load('file.txt'). I created a spark data frame before this btw. When I run this code in Jupyter notebook it keeps giving me some java error and the guy in this video did the exact same thing as I did and he didn't get this error. Can someone help me resolve this issue, please?
A lot of thanks!
I think I solved this issue by setting the "numFeatures" in the option method:
training = spark.read.format('libsvm').option("numFeatures","10").load('sample_linear_regression_data.txt', header=True)
You can use this custom function to read libsvm file.
from pyspark.sql import Row
from pyspark.ml.linalg import SparseVector
def read_libsvm(filepath, spark_session):
'''
A utility function that takes in a libsvm file and turn it to a pyspark dataframe.
Args:
filepath (str): The file path to the data file.
spark_session (object): The SparkSession object to create dataframe.
Returns:
A pyspark dataframe that contains the data loaded.
'''
with open(filepath, 'r') as f:
raw_data = [x.split() for x in f.readlines()]
outcome = [int(x[0]) for x in raw_data]
index_value_dict = list()
for row in raw_data:
index_value_dict.append(dict([(int(x.split(':')[0]), float(x.split(':')[1]))
for x in row[1:]]))
max_idx = max([max(x.keys()) for x in index_value_dict])
rows = [
Row(
label=outcome[i],
feat_vector=SparseVector(max_idx + 1, index_value_dict[i])
)
for i in range(len(index_value_dict))
]
df = spark_session.createDataFrame(rows)
return df
Usage:
my_data = read_libsvm(filepath="sample_libsvm_data.txt", spark_session=spark)
You can try to load via:
from pyspark.mllib.util import MLUtils
df = MLUtils.loadLibSVMFile(sc,"data.libsvm",numFeatures=781).toDF()
sc is Spark context and df is resulting data frame.

Matrix Multiplication A^T * A in PySpark

I asked a similar question yesterday - Matrix Multiplication between two RDD[Array[Double]] in Spark - however I've decided to shift to pyspark to do this. I've made some progress loading and reformatting the data - Pyspark map from RDD of strings to RDD of list of doubles - however the matrix multiplcation is difficult. Let me share my progress first:
matrix1.txt
1.2 3.4 2.3
2.3 1.1 1.5
3.3 1.8 4.5
5.3 2.2 4.5
9.3 8.1 0.3
4.5 4.3 2.1
it's difficult to share files, however this is what my matrix1.txt file looks like. It is a space-delimited text file including the values of a matrix. Next is the code:
# do the imports for pyspark and numpy
from pyspark import SparkConf, SparkContext
import numpy as np
# loadmatrix is a helper function used to read matrix1.txt and format
# from RDD of strings to RDD of list of floats
def loadmatrix(sc):
data = sc.textFile("matrix1.txt").map(lambda line: line.split(' ')).map(lambda line: [float(x) for x in line])
return(data)
# this is the function I am struggling with, it should take a line of the
# matrix (formatted as list of floats), compute an outer product with itself
def AtransposeA(line):
# pseudocode for this would be...
# outerprod = compute line * line^transpose
# return(outerprod)
# here is the main body of my file
if __name__ == "__main__":
# create the conf, sc objects, then use loadmatrix to read data
conf = SparkConf().setAppName('SVD').setMaster('local')
sc = SparkContext(conf = conf)
mymatrix = loadmatrix(sc)
# this is pseudocode for calling AtransposeA
ATA = mymatrix.map(lambda line: AtransposeA(line)).reduce(elementwise add all the outerproducts)
# the SVD of ATA is computed below
U, S, V = np.linalg.svd(ATA)
# ...
My approach is as follows - to do matrix multiplication A^T * A, I create a function that computes outer products of rows of A. The elementwise sum of all of the outerproducts is the product I want. I then call AtransposeA() in a map function, that way is it performed on each row of the matrix, and finally I use a reduce() to add the resulting matrices.
I'm struggling thinking about how the AtransposeA function should look. How can I do an outerproduct in pyspark like this? Thanks in advance for help!
First, consider why you want to use Spark for this. It sounds like all your data fits in memory, in which case you can use numpy and pandas in a very straight-forward way.
If your data isn't structured so that rows are independent, then it probably can't be parallelized by sending groups of rows to different nodes, which is the whole point of using Spark.
Having said that... here is some pyspark (2.1.1) code that I think does what you want.
# read the matrix file
df = spark.read.csv("matrix1.txt",sep=" ",inferSchema=True)
df.show()
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
|1.2|3.4|2.3|
|2.3|1.1|1.5|
|3.3|1.8|4.5|
|5.3|2.2|4.5|
|9.3|8.1|0.3|
|4.5|4.3|2.1|
+---+---+---+
# do the sum of the multiplication that we want, and get
# one data frame for each column
colDFs = []
for c2 in df.columns:
colDFs.append( df.select( [ F.sum(df[c1]*df[c2]).alias("op_{0}".format(i)) for i,c1 in enumerate(df.columns) ] ) )
# now union those separate data frames to build the "matrix"
mtxDF = reduce(lambda a,b: a.select(a.columns).union(b.select(a.columns)), colDFs )
mtxDF.show()
+------------------+------------------+------------------+
| op_0| op_1| op_2|
+------------------+------------------+------------------+
| 152.45|118.88999999999999| 57.15|
|118.88999999999999|104.94999999999999| 38.93|
| 57.15| 38.93|52.540000000000006|
+------------------+------------------+------------------+
This seems to be the same result that you get from numpy.
a = numpy.genfromtxt("matrix1.txt")
numpy.dot(a.T, a)
array([[ 152.45, 118.89, 57.15],
[ 118.89, 104.95, 38.93],
[ 57.15, 38.93, 52.54]])