How to generate / split data randomly in PySpark

How to generate / split data randomly in PySpark - scala

The following line of Scala code in Apache Spark will split data randomly across 8 partition:
import org.apache.spark.sql.functions.rand
df
.repartition(8, col("person_country"), rand)
.write
.partitionBy("person_country")
.csv(outputPath)
Can someone show me how to do the equivalent with PySpark? I have attempted it myself with the following code, but it fails
from pyspark.sql.functions import rand
df\
.repartition(8, col("person_country"), rand)\
.write.partitionBy("person_country")\
.format('csv').mode('Overwrite')\
.save("outputPath")
Any thoughts?

repartition(8, col("person_country"), rand()) with parenthesis after rand

Related

Create multiple rows of fixed length from a data frame column in Pyspark

My input is a dataframe column in pyspark and it has only one column DETAIL_REC.
detail_df.show()
DETAIL_REC
================================
ABC12345678ABC98765543ABC98762345
detail_df.printSchema()
root
|-- DETAIL_REC: string(nullable =true)
For every 11th char/string it has to be in next row of dataframe for downstream process to consume this.
Expected output Should be multiple rows in dataframe
DETAIL_REC (No spaces lines after each record)
==============
ABC12345678
ABC98765543
ABC98762345

If you have spark 2.4+ version, we can make use of higher order functions to do it like below:
from pyspark.sql import functions as F
n = 11
output = df.withColumn("SubstrCol",F.explode((F.expr(f"""filter(
transform(
sequence(0,length(DETAIL_REC),{n})
,x-> substring(DETAIL_REC,x+1,{n}))
,y->y <> '')"""))))
output.show(truncate=False)
+---------------------------------+-----------+
|DETAIL_REC |SubstrCol |
+---------------------------------+-----------+
|ABC12345678ABC98765543ABC98762345|ABC12345678|
|ABC12345678ABC98765543ABC98762345|ABC98765543|
|ABC12345678ABC98765543ABC98762345|ABC98762345|
+---------------------------------+-----------+
Logic used:
First generate a sequence of integers starting from 0 to length of the string in steps of 11 (n)
Using transform iterate through this sequence and keep getting substrings from the original string (This keeps changing the start position.
Filter out any blank strings from the resulting array and explode this array.
For lower versions of spark, use a udf with textwrap or any other functions as addressed here:
from pyspark.sql import functions as F, types as T
from textwrap import wrap
n = 11
myudf = F.udf(lambda x: wrap(x,n),T.ArrayType(T.StringType()))
output = df.withColumn("SubstrCol",F.explode(myudf("DETAIL_REC")))
output.show(truncate=False)
+---------------------------------+-----------+
|DETAIL_REC |SubstrCol |
+---------------------------------+-----------+
|ABC12345678ABC98765543ABC98762345|ABC12345678|
|ABC12345678ABC98765543ABC98762345|ABC98765543|
|ABC12345678ABC98765543ABC98762345|ABC98762345|
+---------------------------------+-----------+

How can I reuse the dataframe and use alternative for iloc to run an iterative imputer in Azure databricks

I am running an iterative imputer in Jupyter Notebook to first mark the known incorrect values as "Nan" and then run the iterative imputer to impute the correct values to achieve required sharpness in the data. The sample code is given below:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np
import pandas as p
idx = [761, 762, 763, 764]
cols = ['11','12','13','14']
def fit_imputer():
for i in range(len(idx)):
for col in cols:
dfClean.iloc[idx[i], col] = np.nan
print('Index = {} Col = {} Defiled Value is: {}'.format(idx[i], col, dfClean.iloc[idx[i], col]))
# Run Imputer for Individual row
tempOut = imp.fit_transform(dfClean)
print("Imputed Value = ",tempOut[idx[i],col] )
dfClean.iloc[idx[i], col] = tempOut[idx[i],col]
print("new dfClean Value = ",dfClean.iloc[idx[i], col])
origVal.append(dfClean_Orig.iloc[idx[i], col])
I get an error when I try to run this code on Azure Databricks using pyspark or scala. Because the dataframes in spark are immutable also I cannot use iloc as I have used it in pandas dataframe.
Is there a way or better way of implementing such imputation in databricks?

spark.read.format('libsvm') not working with python

I am learning PYSPARK and encountered a problem that I can't fix. I followed this video to copy codes from the PYSPARK documentation to load data for linear regression. The code I got from the documentation was spark.read.format('libsvm').load('file.txt'). I created a spark data frame before this btw. When I run this code in Jupyter notebook it keeps giving me some java error and the guy in this video did the exact same thing as I did and he didn't get this error. Can someone help me resolve this issue, please?
A lot of thanks!

I think I solved this issue by setting the "numFeatures" in the option method:
training = spark.read.format('libsvm').option("numFeatures","10").load('sample_linear_regression_data.txt', header=True)

You can use this custom function to read libsvm file.
from pyspark.sql import Row
from pyspark.ml.linalg import SparseVector
def read_libsvm(filepath, spark_session):
'''
A utility function that takes in a libsvm file and turn it to a pyspark dataframe.
Args:
filepath (str): The file path to the data file.
spark_session (object): The SparkSession object to create dataframe.
Returns:
A pyspark dataframe that contains the data loaded.
'''
with open(filepath, 'r') as f:
raw_data = [x.split() for x in f.readlines()]
outcome = [int(x[0]) for x in raw_data]
index_value_dict = list()
for row in raw_data:
index_value_dict.append(dict([(int(x.split(':')[0]), float(x.split(':')[1]))
for x in row[1:]]))
max_idx = max([max(x.keys()) for x in index_value_dict])
rows = [
Row(
label=outcome[i],
feat_vector=SparseVector(max_idx + 1, index_value_dict[i])
)
for i in range(len(index_value_dict))
]
df = spark_session.createDataFrame(rows)
return df
Usage:
my_data = read_libsvm(filepath="sample_libsvm_data.txt", spark_session=spark)

You can try to load via:
from pyspark.mllib.util import MLUtils
df = MLUtils.loadLibSVMFile(sc,"data.libsvm",numFeatures=781).toDF()
sc is Spark context and df is resulting data frame.

I have an issue with regex extract with multiple matches

I am trying to extract 60 ML and 0.5 ML from the string "60 ML of paracetomol and 0.5 ML of XYZ" . This string is part of a column X in spark dataframe. Though I am able to test my regex code to extract 60 ML and 0.5 ML in regex validator, I am not able to extract it using regexp_extract as it targets only 1st matches. Hence I am getting only 60 ML.
Can you suggest me the best way of doing it using UDF ?

Here is how you can do it with a python UDF:
from pyspark.sql.types import *
from pyspark.sql.functions import *
import re
data = [('60 ML of paracetomol and 0.5 ML of XYZ',)]
df = sc.parallelize(data).toDF('str:string')
# Define the function you want to return
def extract(s)
all_matches = re.findall(r'\d+(?:.\d+)? ML', s)
return all_matches
# Create the UDF, note that you need to declare the return schema matching the returned type
extract_udf = udf(extract, ArrayType(StringType()))
# Apply it
df2 = df.withColumn('extracted', extract_udf('str'))
Python UDFs take a significant performance hit over native DataFrame operations. After thinking about it a little more, here is another way to do it without using a UDF. The general idea is replace all the text that isn't what you want with commas, then split on comma to create your array of final values. If you only want the numbers you can update the regex's to take 'ML' out of the capture group.
pattern = r'\d+(?:\.\d+)? ML'
split_pattern = r'.*?({pattern})'.format(pattern=pattern)
end_pattern = r'(.*{pattern}).*?$'.format(pattern=pattern)
df2 = df.withColumn('a', regexp_replace('str', split_pattern, '$1,'))
df3 = df2.withColumn('a', regexp_replace('a', end_pattern, '$1'))
df4 = df3.withColumn('a', split('a', r','))

Matrix Multiplication A^T * A in PySpark

I asked a similar question yesterday - Matrix Multiplication between two RDD[Array[Double]] in Spark - however I've decided to shift to pyspark to do this. I've made some progress loading and reformatting the data - Pyspark map from RDD of strings to RDD of list of doubles - however the matrix multiplcation is difficult. Let me share my progress first:
matrix1.txt
1.2 3.4 2.3
2.3 1.1 1.5
3.3 1.8 4.5
5.3 2.2 4.5
9.3 8.1 0.3
4.5 4.3 2.1
it's difficult to share files, however this is what my matrix1.txt file looks like. It is a space-delimited text file including the values of a matrix. Next is the code:
# do the imports for pyspark and numpy
from pyspark import SparkConf, SparkContext
import numpy as np
# loadmatrix is a helper function used to read matrix1.txt and format
# from RDD of strings to RDD of list of floats
def loadmatrix(sc):
data = sc.textFile("matrix1.txt").map(lambda line: line.split(' ')).map(lambda line: [float(x) for x in line])
return(data)
# this is the function I am struggling with, it should take a line of the
# matrix (formatted as list of floats), compute an outer product with itself
def AtransposeA(line):
# pseudocode for this would be...
# outerprod = compute line * line^transpose
# return(outerprod)
# here is the main body of my file
if __name__ == "__main__":
# create the conf, sc objects, then use loadmatrix to read data
conf = SparkConf().setAppName('SVD').setMaster('local')
sc = SparkContext(conf = conf)
mymatrix = loadmatrix(sc)
# this is pseudocode for calling AtransposeA
ATA = mymatrix.map(lambda line: AtransposeA(line)).reduce(elementwise add all the outerproducts)
# the SVD of ATA is computed below
U, S, V = np.linalg.svd(ATA)
# ...
My approach is as follows - to do matrix multiplication A^T * A, I create a function that computes outer products of rows of A. The elementwise sum of all of the outerproducts is the product I want. I then call AtransposeA() in a map function, that way is it performed on each row of the matrix, and finally I use a reduce() to add the resulting matrices.
I'm struggling thinking about how the AtransposeA function should look. How can I do an outerproduct in pyspark like this? Thanks in advance for help!

First, consider why you want to use Spark for this. It sounds like all your data fits in memory, in which case you can use numpy and pandas in a very straight-forward way.
If your data isn't structured so that rows are independent, then it probably can't be parallelized by sending groups of rows to different nodes, which is the whole point of using Spark.
Having said that... here is some pyspark (2.1.1) code that I think does what you want.
# read the matrix file
df = spark.read.csv("matrix1.txt",sep=" ",inferSchema=True)
df.show()
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
|1.2|3.4|2.3|
|2.3|1.1|1.5|
|3.3|1.8|4.5|
|5.3|2.2|4.5|
|9.3|8.1|0.3|
|4.5|4.3|2.1|
+---+---+---+
# do the sum of the multiplication that we want, and get
# one data frame for each column
colDFs = []
for c2 in df.columns:
colDFs.append( df.select( [ F.sum(df[c1]*df[c2]).alias("op_{0}".format(i)) for i,c1 in enumerate(df.columns) ] ) )
# now union those separate data frames to build the "matrix"
mtxDF = reduce(lambda a,b: a.select(a.columns).union(b.select(a.columns)), colDFs )
mtxDF.show()
+------------------+------------------+------------------+
| op_0| op_1| op_2|
+------------------+------------------+------------------+
| 152.45|118.88999999999999| 57.15|
|118.88999999999999|104.94999999999999| 38.93|
| 57.15| 38.93|52.540000000000006|
+------------------+------------------+------------------+
This seems to be the same result that you get from numpy.
a = numpy.genfromtxt("matrix1.txt")
numpy.dot(a.T, a)
array([[ 152.45, 118.89, 57.15],
[ 118.89, 104.95, 38.93],
[ 57.15, 38.93, 52.54]])

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to generate / split data randomly in PySpark - scala

repartition(8, col("person_country"), rand()) with parenthesis after rand

Related

Create multiple rows of fixed length from a data frame column in Pyspark

How can I reuse the dataframe and use alternative for iloc to run an iterative imputer in Azure databricks

spark.read.format('libsvm') not working with python

I have an issue with regex extract with multiple matches

Matrix Multiplication A^T * A in PySpark

Categories

Resources